2+3) # missing opening parenthesis
2 + * 3 # missing number between the addition and the multiplication operators
0,5 # use dot instead of comma to denote one halfBoerhaave 2025: Using R for data analysis
R and RStudio basics
Console as a calculator (lecture) 👨🏫
Gives: “arithmetic”, “callingFunctions”, “help”, “console”, “prompt”, “comment”, sqrt, log10, exp, sin, sum.
Installation
Go to RStudio download page.
Follow RStudio Desktop (and R) installation steps and once installed, start the RStudio program.
Global options
Important preparation of settings in RStudio:
- Go to the “Tools” menu and select “Global Options”. A window with “Options” will appear.
- Select “General” section, “Basic” panel and look at the “Workspace” part.
- Disable option “Restore .RData into workspace at startup”.
- At “Save workspace to .RData on exit” select “Never”.
- Finally, press “Apply”.
Console, expressions
Your screen will be divided into several parts (panes).
In the console pane you will see the > sign, called the prompt.
After the prompt you can type expressions in the R language.
For example, type there a mathematical expression: 2+2 and press Enter.
You should see the result: 4.
R performs calculations according to the rules of arithmetic operations.
The order of operations is:
- parentheses
(...) - exponentiation
^ - multiplication
*and division/ - addition
+and subtraction-
Use . (dot) to separate the integer part from the decimal part of a number: for example 3.5.
You may also use the exponent notation: 35e-1 is the same as 3.5 (e-1 denotes 10 to the power of -1).
Note that the # sign is used to start a comment in R.
Everything after the # sign till the end of the line is ignored by R.
Spaces are also (mostly) ignored by R. You can use them to make your code more readable
Errros
Observe how R reports errors. Type the following lines:
Multiline expressions
When Enter is pressed, but an expression is unfinished, R will show a + sign instead of the > prompt. It means that R is waiting for the rest of the expression. Either type the rest of the expression or press Esc (or Ctrl+C) to cancel the expression. Try the following:
3 +
4
(((((
1
)))))Calling functions
More complex operations are provided by functions. For example, the sqrt function calculates the square root of a number. Type sqrt(9) and press Enter.
Call the following functions with the arguments given below:
Getting help
In the RStudio window find another pane with the Help tab. Type sqrt in the search box and press Enter. You will see the help page for the sqrt function. The help page contains the description of the function, its arguments, and examples of usage. To get the help, you may also type ?sqrt in the console and press Enter.
The internet usually provides more up-to-date information. Try to search for R sqrt in your favorite search engine.
Numbers and arithmetic (exercise) 📖
Needs: “arithmetic”, “callingFunctions”, “help”, “console”, “comment”, sqrt, log10, exp, sin, sum.
Gives: round, pi.
Practices: “arithmetic”, “help”, “callingFunctions”, “console”, “comment”, sin, round, pi.
Arithmetic operations and their order
In the console, one-by-one type the expressions given below. Each time before you press Enter, think what should be the result. Are the results as expected?
3*4+5
3+4*5
(3+4)*5
3-2
3--2
3-(-2)
-2--2
8/2
8/-2
8/2*2
8/(2*2)
2^5Real numbers, zero
Check the following expressions (think about the results before you press Enter):
Rounding numbers
Have a look at the following expressions (and check the manual for the round function):
RStudio projects (lecture) 👩🏫
Gives: “RStudioProjects”.
A typical data analysis involves multiple files:
-
Input files with data, fixed before analysis
(e.g. files in various formats containing tables of data to be analysed). -
Script files or report files
(there you develop programs in the R language which analyse the input data and report results).
RStudio uses the concept of a project to organize files of one analysis and separate them from other analyses.
On a disk drive, a project is a directory containing the files of the analysis.
When you open a project, RStudio remembers the files you were working on and the settings you used.
Creating a new project
Follow the subsequent steps to create a new (example) project Learning_R.
From the RStudio menu choose: menu File, New project... and then New directory... and New project....
In the final dialog box:
- As the new directory name type:
Learning_R. - In the next option you may select where to create this directory.
- Press
Create projectto finalize the process.
Check whether you use the new project
Refer to the following Figure:
- Look at the top right corner of the RStudio window to see the name of the current project.
Learning_Rshould be the current project. - The title line shows the project diectory. This is the current directory and there the files of the project are/will be kept.
- Note the project file
Learning_R.Rprojin theFilespanel. Later, when you find this file in a file explorer, once clicked it should start RStudio and open this project.
Copy input data files to the project directory
Although R can be used to analyse data from files located anywhere in the filesystem, it is convenient to have such files in the project directory.
For the following parts of the course you need the file(s): pulseNA.csv.
Download (with a browser) or copy (with the system file explorer) these files to the project directory. Note, that the details of downloading/copying depend on the operating system of your computer.
Once the files are present in the project directory, they should be visible in the Files panel (like the example files pulse.csv and survey.csv below).
R scripts and R Markdown documents (lecture) 👩🏫
Needs: “console”, “RStudioProjects”.
Gives: “RScript”, “RMarkdown”.
What is an R script
R script is a simple text file (with extension .R, for example my_script.R) containing expressions written in the R language. When an R script is sourced, the lines of the script are one-by-one copied to the Console and immediately executed. An R script allows to keep R commands saved in a file (without saving, the commands given in the Console are executed and forgotten).
What is an R Markdown document
An R Markdown document (with extension .Rmd, for example my_report.Rmd) is an extension of an R script. It is a format for developing elegant and reproducible reports. It is a simple text file consisting of:
- Chunks written in R language, where data analysis is performed.
- Free text written in Markdown notation, possibly containing: sections, lists, tables, hyperlinks, formulae: \(r=\sqrt{x^2+y^2}\), etc.
The Knit operation converts an R Markdown document into a report file. Many report output formats are available: HTML, PDF (printable pages or presentations), Microsoft Word documents, etc.
(For example: this page is written in R Markdown and knitted to HTML)
Follow these links to find more:
Create a new R Markdown document
Note: Before creating a new document, first check whether you work in the Learning_R project.
From the RStudio menu choose File/New File/R Markdown...
In the dialog box you may specify some details of the analysis (later change is possible):
- Title of the analysis.
(Note: this is the title of your future report, and not the name of the file where the document gets saved.) - Name of the author.
- Output format (for now, select HTML).
Once accepted, a new document Untitled1 is shown.
The new document is not empty; it contains a demo example written in R Markdown.
Knit an R Markdown document
Knitting is the name of the process of conversion from R Markdown text file to a report in one of the target formats (for example: HTML).
- Press the
Knitbutton to start the conversion. - If the R Markdown document does not have a file name yet, a save dialog box will pop up to ask for the file name.
- Some R libraries might get automatically installed before the first
Knit. - The knitting process might take a few seconds and then the final document should appear either in a new window or in the
Viewerpane.
The Figure below shows the result of knitting of the standard example (R Markdown was in file namedLesson_1.Rmdand was knitted to HTML fileLesson_1.html):
Parts of an R Markdown document
An R Markdown document consists of: header, chunks and free text.
The header:
- Is located at the first lines of the file.
- Starts and ends with a line containing
---(three minus symbols). - May specify title, author, date and possibly other properties of the document.
A chunk:
- Is a place where code in R language is written. The result of this code will be displayed below the chunk.
- Starts with a line containing
```{r}(three backticks, thenrin curly braces). - Ends with a line containing
``` - Has a gray background (in RStudio).
A free text:
- Not the header and not a chunk.
- Has a white background (in RStudio).
Try to locate these elements in the picture shown below:
Editing an R Markdown document
Useful keystrokes:
- To insert a new chunk you may use menu
Code/Insert Chunkor the keyboard shortcut shown in the menu. - To run the chunk containing the cursor you may use menu
Code/Run Region/Run Current Chunkor the keyboard shortcut shown in the menu. - To run the line containing the cursor or currently selected text you may use
Ctrl-Enter.
Reading a table from a .csv file (lecture) 👨🏫
Files: pulseNA.csv, pulseNA.csv.gz.
Needs: “RMarkdown”.
Gives: read.csv, “pulseNA.csv”, head, $, str.
The experiment
Students in an introductory statistics class (MS212 taught by Professor John Eccleston and Dr Richard Wilson at The University of Queensland, years 1993-1998) participated in a simple experiment.
The students measured their own pulse rate. They were then asked to flip a coin. If the coin came up heads, they were to run in place for one minute. Otherwise they sat without movement for one minute. Then everyone measured their pulse again.
The pulse rates and other physiological and lifestyle data are given in the data table. A slightly modified version of the table is available in the pulseNA.csv file.
Preparation
Make sure that the file pulseNA.csv is available in the working directory (check pulseNA.csv is shown the Files pane).
Work in a newly created R Markdown document, saved in the same workind directory (check that your current .Rmd file is also shown in the Files pane).
Type R expressions shown below in R code chunks (on gray background) in the R Markdown document.
Use Ctrl-Enter to execute the expressions in the console.
Reading the table
The read.csv function (from the base-R package) reads a table from a file in the comma-separated values (.csv) format. The function returns the table (as a data.frame object).
Note: read.csv and read_csv are two different functions providing similar functionality, and they will be discussed later in more detail.
In the example below, the table gets assigned to the variable d. Then, the 3 top rows of the table from d are extracted with the head function and printed.
id name height weight age gender smokes alcohol exercise ran pulse1
1 1993_A Bonnie 173 57 18 female no yes moderate FALSE 86
2 1993_B Melanie 179 58 19 female no yes moderate TRUE 82
3 1993_C Consuelo 167 62 18 female no yes high TRUE 96
pulse2 year
1 88 1993
2 150 1993
3 176 1993
The table should have more than 10 columns (id, name, …). If you don’t see separate columns, the file with the table might be broken. Try then to load the file named pulseNA.csv.gz (a compressed version of the pulseNA.csv file; R will unpack it automatically). If you still have problems, consult a teacher.
If the table is too wide, then one row of the table might be shown in multiple lines.
Table properties
A quick overview of the table can be obtained with the str function.
Number of rows and columns is shown, as well as names and types of data in the columns.
str(d)'data.frame': 110 obs. of 13 variables:
$ id : chr "1993_A" "1993_B" "1993_C" "1993_D" ...
$ name : chr "Bonnie" "Melanie" "Consuelo" "Travis" ...
$ height : int 173 179 167 195 173 184 162 169 164 168 ...
$ weight : num 57 58 62 84 64 74 57 55 56 60 ...
$ age : int 18 19 18 18 18 22 20 18 19 23 ...
$ gender : chr "female" "female" "female" "male" ...
$ smokes : chr "no" "no" "no" "no" ...
$ alcohol : chr "yes" "yes" "yes" "yes" ...
$ exercise: chr "moderate" "moderate" "high" "high" ...
$ ran : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
$ pulse1 : int 86 82 96 71 90 78 68 71 68 88 ...
$ pulse2 : int 88 150 176 73 88 141 72 77 68 150 ...
$ year : int 1993 1993 1993 1993 1993 1993 1993 1993 1993 1993 ...
Accessing table columns
With the $ operator, you can access a single column of the table.
d$pulse1 # column `pulse1` as a vector of numbers [1] 86 82 96 71 90 78 68 71 68 88 76 74 70 78 69 77 64 80
[19] 83 78 88 70 78 80 68 70 62 81 78 86 59 68 75 74 60 70
[37] 80 58 84 104 66 84 65 80 66 104 76 70 66 92 70 63 65 76
[55] 56 64 60 68 80 65 47 50 80 76 70 76 72 80 76 85 49 76
[73] 145 83 72 NA 60 80 70 68 78 52 74 75 72 80 84 74 90 61
[91] 85 78 76 90 64 64 88 64 82 88 74 88 92 76 71 119 90 86
[109] 69 75
Observe, that you may also use the head function to get first elements of that vector:
head( d$pulse1, 10 ) # first 10 elements of the vector from `pulse1` column [1] 86 82 96 71 90 78 68 71 68 88
Reading a table from a .csv file, new way (lecture) 👨🏫
Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, head.
Gives: “tidyverse”, read_csv, install.packages, library.
The tidyverse package
The read.csv function belongs to the base-R package. It provides functionality as it was described in early years of R development. Functions developed recently are usually more user-friendly and less error-prone. The tidyverse package is a collection of such functions, and this course favors tidyverse functions over base-R functions.
More information about the tidyverse package at tidyverse.org.
Package installation
To install the tidyverse package type once in the console:
install.packages("tidyverse")Some messages will be shown, and possibly some additional packages will be installed.
Reading a table
The following code reads the pulseNA.csv file into the d variable using the read_csv function from the tidyverse package. The function returns the table as an object of type tibble, which is a modern version of the data.frame with many improvements.
Before you can use the read_csv function, you need to load the tidyverse package with the library function. You do it once, usually in a separate chunk at the top of the script.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <- read_csv( "pulseNA.csv" ) # NOTE: read_csv not read.csv !!!Rows: 110 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): id, name, gender, smokes, alcohol, exercise
dbl (6): height, weight, age, pulse1, pulse2, year
lgl (1): ran
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head( d, 3 )# A tibble: 3 × 13
id name height weight age gender smokes alcohol exercise ran pulse1
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate FALSE 86
2 1993_B Melanie 179 58 19 female no yes moderate TRUE 82
3 1993_C Consue… 167 62 18 female no yes high TRUE 96
# ℹ 2 more variables: pulse2 <dbl>, year <dbl>
Reading a table (silent)
You can ignore the messages (including the conflict ones). For now, you may prefer to use the following lines, which do not show the messages:
suppressPackageStartupMessages( library(tidyverse) ) # once in the top chunk
d <- read_csv( "pulseNA.csv", show_col_types=FALSE )
head( d, 3 )# A tibble: 3 × 13
id name height weight age gender smokes alcohol exercise ran pulse1
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate FALSE 86
2 1993_B Melanie 179 58 19 female no yes moderate TRUE 82
3 1993_C Consue… 167 62 18 female no yes high TRUE 96
# ℹ 2 more variables: pulse2 <dbl>, year <dbl>
Table reading practice in a new R session (exercise) 📖
Files: pulseNA.csv.
Needs: read.csv, read_csv, “tidyverse”, head, str.
Gives: print, as_tibble, as.data.frame, tail.
Practices: head, tail, str, print, as_tibble, as.data.frame.
Practice this exercise in a new, separate RStudio session. Create a new R project, download the pulseNA.csv file, and save it in the project directory. Create a new R Markdown script, save it in the project directory. Finally, develop the content of the script as described below.
Read pulseNA.csv twice:
- into the
bdvariable as adata.frameusing theread.csvfunction from base-R, - into the
tdvariable as atibbleusing theread_csvfunction from thetidyversepackage.
Solution…
suppressPackageStartupMessages( library(tidyverse) )
bd <- read.csv( "pulseNA.csv" )
td <- read_csv( "pulseNA.csv" )Both loaded objects represent the same table but they are not identical. Try the following code and observe small differences in output of str, head, and tail functions applied to both tables.
A tibble table by default prints only its top 10 rows - observe, how to use print to see the full table content. Also try to use the as_tibble function to convert the bd data.frame to a tibble and the as.data.frame function to convert the td tibble to a data.frame.
str(bd)
str(td)
head(bd) # top 6 rows
head(td) # top 6 rows
tail(bd) # bottom 6 rows
tail(td) # bottom 6 rows
bd$hei # ERROR-PRONE in data.frame, a column with name starting with "hei"
td$hei # ERROR in tibble, column name must much exactly
print(td)
print(td, n=1000) # print 1000 rows
as_tibble(bd) # convert data.frame to tibble
as.data.frame(td) # convert tibble to data.frameInvestigating a table (exercise) 📖
Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, head, $, str.
Gives: colnames, names, dim, nrow, ncol, head, tail, “[rows,]”, min, max, range, fivenum, unique, table.
Practices: str, head, tail, colnames, names, dim, nrow, ncol, “[rows,]”, min, max, range, fivenum, unique, table.
(Note: This exercise illustrates the first steps of analysis of any new dataset.)
Find the pulseNA.csv file and read it into the d data frame (for this file it makes no difference whether you use the base-R read.csv function or the read_csv function from the tidyverse package).
Solution…
d <- read.csv( "pulseNA.csv" )Investigating table properties
Investigate the structure of the d data frame with str. A lot of information is shown at once.
Let’s try to access some of the information step by step:
- Learn how to use
colnamesandnamesfunctions to get the names of the table columns.
- Learn how to use
dim,nrow,ncolfunctions to get the numbers of rows and columns in the table.
Investigating rows
- Learn how to use the
head/tailfunctions to get the first/last 15 rows of theddata frame. - Get the same first/last rows using the single square bracket operator (hint:
d[10:15,]). Note, there is a comma after the row index range.
Investigating columns
Understand the meaning of the columns. Decide, what type of data each columns represents (numerical, categorical, ordinal).
- For numerical columns: learn how to find the range of values with functions
min,max.
Some columns contain missing data, denoted withNA.
Learn how to omit missing values when calculating the minimum and maximum.
Hint:min(d$pulse, na.rm=TRUE). - For numerical columns: you do not need to find
minandmaxseparately.
Learn how to userangeandfivenumfunctions to get more information about the range of values in a single function call. - For categorical/ordinal columns: learn how to use the
uniquefunction to get the unique values of the column.
- For categorical/ordinal columns: learn how to count the occurrences of each level with the
tablefunction.
Hint:table(d$gender, useNA="always"). - For ordinal columns: what should be the order of the levels (it will need to be defined manually, later).
Solution…
min(d$height) # same with d$weight
max(d$height)
range(d$height)
fivenum(d$height)
min(d$age) # there are missing values in `age` column
min(d$age, na.rm=TRUE)
range(d$age, na.rm=TRUE)
range(d$pulse1) # no missing values
fivenum(d$pulse1)
range(d$pulse2, na.rm=TRUE) # some missing values
fivenum(d$pulse2) # fivenum by default omits missing values
unique(d$gender)
table(d$gender, useNA="always")
unique(d$smokes)
table(d$smokes, useNA="always")
table(d$alcohol, useNA="always")
table(d$exercise, useNA="always")
table(d$ran, useNA="always")
range(d$year) # it looks like numbers
table(d$year) # but should be treated as categorical/ordinal
# ordinal columns:
# - exercise: low moderate high
# - year: 1993 1995 1996 1997 1998Simple histograms, scatterplots, barplots, boxplots (exercise) 📖
Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, $, str, min, max, table, fivenum.
Gives: hist, plot, barplot, boxplot, pairs, mean, sd.
Practices: hist, plot, barplot, boxplot, pairs, min, max, mean, sd, fivenum, table.
(Note: This exercise illustrates the first steps of analysis of any new dataset.)
Find the pulseNA.csv file and read it into the d data frame (for this file it makes no difference whether you use the base-R read.csv function or the read_csv function from the tidyverse package).
Solution…
d <- read.csv( "pulseNA.csv" )Histogram
Pick a numerical column from the table. Find out how to use the hist function to produce a histogram of the column.
Calculate the minimum and maximum values of the column. In a similar way, calculate the mean and the standard deviation (sd). Relate the results to the histogram.
Scatterplot(s)
Find out how to use plot to produce a scatterplot of the dependence of height on weight (horizontal). Then, try the following code and discover its meaning:
Barplot
The table function can be used to count the occurrences of each level in a categorical column. Count the occurrences of each level in the exercise column. Now, find out how to put the result of the calcluation into the barplot function to produce a barplot of the counts of each exercise value. Observe, how the barplot plot corresponds to the results obtained from table.
Repeat the process for the year column.
Boxplot
Study the column weight. Use the fivenum function to get the five-number summary of the column. Then, use the boxplot function to produce a boxplot of the weight column. Observe, how the boxplot corresponds to the five-number summary.
Finally, understand the plot generated by the following code:
boxplot( weight ~ gender, data=d )Data structures 1/2
Variables and assignment (lecture) 👨🏫
Needs: “arithmetic”, “console”, “comment”.
Gives: “<-”, “=”, “assignment”.
Variable and its value
A variable stores a value, which can be explicitly provided or a result of a calculation.
Type the following lines in the Console. Observe changes in the Environment pane after each Enter.
gardenWidth <- 5 # Assign 5 to the variable gardenWidth
gardenLength <- 6 # Assign 6 to the variable gardenLength
gardenArea <- gardenWidth * gardenLength # Calculate the area of the garden
# # Assign the result to the variable gardenAreaTo find out what the value of a variable is, type its name:
gardenArea[1] 30
A frequent error is to ask for a non-existing variable. Type to see the error:
gggardenAreaError: object 'gggardenArea' not found
Assignment operator
The arrow symbol <- is called the assignment operator:
x <- 5 # Assign 5 to xYou may also use the equal sign = for assignment:
y = 6 # Assign 6 to yNo assignment, no change
Variable assignment is needed to store a new value or to replace a previous one.
When there is no assignment, the result is printed to screen and lost.
For example, the sequence of statements below is the wrong way to “increase x by one”.
The result is calculated but it is not stored.
Type these commands line by line. Observe when value of x changes in the Environment pane:
x <- 5 # Assign 5 to x
x # Prints x (the current value is 5)[1] 5
x + 1 # Calculate x+1, prints 6[1] 6
x # Prints x (the value is still 5; no assignment above)[1] 5
Here is the correct way to “increase x by one”.
Type these commands line by line. Observe when value of x changes in the Environment pane:
x <- 5 # Assign 5 to x
x # Prints x (the current value is 5)[1] 5
x <- x + 1 # Calculate x+1 and assign the result to x
x # Prints x (the value is 6)[1] 6
Remember: no assignment, no change.
Where variables are stored
The variables are stored in R memory and RStudio shows them in the Environment pane.
When you close RStudio, all variables are lost.
We will learn how to write scripts to store commands used to calculate variable values. The scripts provide reproducibility (they document how the results were calculated).
Note: If you notice that some variables are present in the Environment pane just after you have started RStudio, then it means that you load state of the previous session. Follow the instructions in the Calculator lecture to disable this behavior in Global Options.
Valid variable names
Variable names in R are case sensitive (all other symbols also).
Note that _ is allowed in variable names.
Numbers (not at the beginning) are allowed too.
Here are some examples (and each one is a different variable):
gardenLength <- 6
garden_length <- 6
gArDeNlEnGtH <- 6 # valid, but bad, don't use this
garden.length <- 6 # valid, but may lead to mistakes, don't use this
len1 <- 5 # len1 is good, 1len is wrong
len2 <- 6Choose meaningful variable names for readable scripts.
Naturally, there are some limitations. Do not use variable names:
- containing a space (possible, but not recommended)
- containing any of
@#!%$^()-+=!~?,<>}{][ -
for,while,repeat,if,else,TRUE,FALSE,Inf,NA,NaN, … (R reserved names)
Autocomplete names
RStudio provides autocomplete feature, useful for long variable names.
Start typing the first letters of a (variable) name and press Tab.
Vectors (lecture) 👨🏫
Needs: “console”, “arithmetic”, round, pi.
Gives: “Vectors”, c, length, sum, mean, median, “:”, seq, TRUE, FALSE, NA, “RelOps”, is.na, class, str, “[…]”, names.
A vector is a container of (multiple) elements:
- All elements must be of the same type (e.g. all are numbers).
- The elements are kept at numbered positions (there is the first element, second, …).
- The elements might be given names (first may be named “Jay”, second “Gloria”, …).
Some examples of a vector (and its type):
- Heights of students (a vector of numbers,
numerical). - Names of students (a vector of texts,
character). - Whether students have siblings (a vector of FALSE/TRUE values,
logical). - Eye color of students (a vector of values from a limited choice list,
factor).
Vector is the primary data structure of the R language.
Note: A table (data.frame or tible discussed later) can be seen as a collection of vectors (columns).
Numerical vector
A vector of numbers you can create with the combine function c. Arithmetic operations and many functions work on vectors elementwise.
radiuses <- c(3.5, 6, 7, 1+1)
radiuses # print the vector[1] 3.5 6.0 7.0 2.0
radiuses + 2 # add 2 to each element[1] 5.5 8.0 9.0 4.0
Vector elements can have names, and the names are carried over in operations:
radiuses <- c(typical=3.5, large=6, huge=7, tiny=1+1)
radiusestypical large huge tiny
3.5 6.0 7.0 2.0
areas <- pi*radiuses^2
areas typical large huge tiny
38.48451 113.09734 153.93804 12.56637
round( areas, 1 ) # round to one decimal placetypical large huge tiny
38.5 113.1 153.9 12.6
Some functions summarize a vector to a single number:
Integer sequence
A simple regular sequence you can create with : (colon operator). Try:
positions <- 5:15
positions [1] 5 6 7 8 9 10 11 12 13 14 15
You can also use the function seq, which allows to specify the step:
spacedPositions <- seq(5,15,3)
spacedPositions[1] 5 8 11 14
Character vector
A character vector, so a vector of any texts, you may also create with the combine function c. Both single quotes ('word') and double quotes ("word") are allowed (must be identical at the beginning and the end of a text):
actors <- c( "Jay", "Gloria", "Claire", "Phil", 'Mitchell', 'Cameron' )
actors[1] "Jay" "Gloria" "Claire" "Phil" "Mitchell" "Cameron"
For example, a character vector is used to store names of vector elements. Use names function to access them:
Logical vector
In R the following words denote logical values: FALSE and TRUE. You may also create a logical vector with the combine function c, but typically, logical vectors are results of logical conditions:
temperatures <- c( 36.4, 35.8, 37.2, 38.9, 40.2, 37.1, 35.8, 34.8 )
temperatures > 36.6 # the result is a logical vector[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Such a logical vector can be used to filter elements of another vector:
temperatures[ temperatures > 36.6 ] # greater than 36.6[1] 37.2 38.9 40.2 37.1
temperatures[ temperatures <= 36.6 ] # less or equal to 36.6[1] 36.4 35.8 35.8 34.8
temperatures[ temperatures != 35.8 ] # not equal to 35.8[1] 36.4 37.2 38.9 40.2 37.1 34.8
sel <- temperatures == 35.8
temperatures[ sel ] # exactly equal to 35.8[1] 35.8 35.8
temperatures[ !sel ] # NOT (exactly equal to 35.8), same as !=[1] 36.4 37.2 38.9 40.2 37.1 34.8
sel <- temperatures > 36 & temperatures < 38
temperatures[ sel ] # larger than 36 AND less than 38[1] 36.4 37.2 37.1
sel <- temperatures <= 35 | temperatures >= 40
sel # less or equal to 35 OR greater or equal to 40[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
The function sum might be used on a logical vector to count how many elements are TRUE:
sum( temperatures > 36.6 )[1] 4
Missing values
Use NA (not available) to represent a single missing value. Missing values can appear in vectors of any type. When a missing value appears in a calculation, the result is usually also a missing value:
Descriptive statistics functions like mean, min, max, sum require the argument na.rm=TRUE to ignore missing values:
The function is.na returns a logical vector with TRUE at positions corresponding to NAs:
When combined with the function sum, it counts the number of missing values in a vector:
Or, when negated (with ! symbol) it counts the number of not missing values in a vector:
To select all non-missing elements do:
radiuses[ !is.na( radiuses ) ][1] 3.5 6.0 7.0 2.0
Vector elements
Use square brackets to select an element from a vector. The first element is at position 1. The following selects an element from the second position:
temperatures <- c( 36.4, 35.8, 37.2, 36.9, 36.1 )
temperatures[2][1] 35.8
The last element can be selected as follows:
temperatures[ length( temperatures ) ][1] 36.1
To select multiple elements at once:
temperatures[ c(3,1) ] # third and first[1] 37.2 36.4
positions <- 2:4
temperatures[positions] # second, third, and fourth[1] 35.8 37.2 36.9
Note, when a vector is printed, the number reported on the left in square brackets shows the position of the next vector element:
101:128 [1] 101 102 103 104 105 106 107 108 109 110 111 112 # 101 is at [1st] pos
[13] 113 114 115 116 117 118 119 120 121 122 123 124 # 113 is at [113th] pos
[25] 125 126 127 128 # 125 is at [125th] pos
When a vector is named, the names can also be used in selection:
Structure and class
Use class to find what type of a vector is stored in a variable.
More information about the structure of a vector can be obtained with the function str.
class( temperatures )[1] "numeric"
class( positions )[1] "integer"
class( actors )[1] "character"
class( isFever )[1] "logical"
str( temperatures ) num [1:8] 36.4 35.8 37.2 38.9 40.2 37.1 35.8 34.8
str( positions ) int [1:11] 5 6 7 8 9 10 11 12 13 14 ...
str( actors ) chr [1:6] "Jay" "Gloria" "Claire" "Phil" "Mitchell" "Cameron"
str( isFever ) logi [1:8] FALSE FALSE TRUE TRUE TRUE TRUE ...
Estimate weights (exercise) 📖
Needs: “Vectors”.
Gives: sort.
Practices: c, round, sort.
Let’s assume that we want to estimate the weights of certain individuals. All we have is the following picture:
Let’s also assume that the individuals in the picture have healthy BMI and that we can roughly estimate their weights based on their heights. The heights in the picture are expressed in imperial units, e.g. 5’0’’ stands for 5 feet and 0 inches, where foot=12’’ and 1’’=2.54cm.
- Create a vector of heights of the individuals in metre based on your rough estimate of inches in the picture.
- Calculate the weights from the obtained heights by the equation \(Weight_{kg}=(Height_m-1)*100\).
- Calculate the mean of obtained weights rounded to 1 decimal.
- Include two additional weights 99 and 68 kg into your data.
- Sort the weights in decreasing order.
Solution…
heights_cm <- c(5*12+7,6*12+2,6*12+6,6*12+2,6*12)*2.54/100 # rough estimate in inches, conversion to meter
weights <- (heights_cm-1)*100 # weight estimate
round( mean( weights ), 1 ) # mean weight rounded to 1 decimal
weights <- c(weights, 68, 99) # adding additional weights
sort(weights, decreasing = TRUE) # sortingMonthly spendings (exercise) 📖
Needs: “Vectors”.
Practices: c, names, “[…]”.
Below is the monthly spending of different categories in the first quarter:
What is the total spending per category?
Solution…
nm <- names( spendJan )
spendJan[ nm ] + spendFeb[ nm ] + spendMar[ nm ]Vectors, continuation (lecture) 👩🏫
Needs: “Vectors”, “:”, length, TRUE, FALSE, “RelOps”.
Gives: “strsplitExample”, “[…]”, sort, unique, “==”, “!=”, any, all, which, duplicated, “[logical]”, paste, paste0.
Splitting a text
Splitting a single text at a certain character (here :) into a vector of words can be achieved as follows.
(Note: this is a more advanced example, the strsplit, [[...]] and lists will be introduced later.)
# Here is a single text with many words separated by :
text <- "apple:banana:lemon:apple:raspberry:pear:apple:lemon"
# More advanced code, which separated text at : into a vector of words
fruits <- strsplit( text, ":" )[[1]]
# Now, the fruits vector contains the words (a character vector)
fruits [1] "apple" "banana" "lemon" "apple" "raspberry" "pear"
[7] "apple" "lemon"
length( fruits )[1] 8
Fruits from the third to the fifth position are:
fruits[ 3:5 ][1] "lemon" "apple" "raspberry"
Sorting
And here is the sorted list of fruits:
sort( fruits )[1] "apple" "apple" "apple" "banana" "lemon" "lemon"
[7] "pear" "raspberry"
sort( fruits, decreasing = TRUE )[1] "raspberry" "pear" "lemon" "lemon" "banana" "apple"
[7] "apple" "apple"
The sort function can also be used to sort numbers:
Unique/duplicated elements
Here are the unique fruits:
unique( fruits )[1] "apple" "banana" "lemon" "raspberry" "pear"
This logical vector tells which elements has appeared before:
duplicated( fruits )[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
These elements are at the following positions:
which( duplicated( fruits ) )[1] 4 7 8
Similarly for numbers:
numbers [1] 3 1 4 1 5 9 2 6 5 3 5
[1] 1 2 3 4 5 6 9
duplicated( numbers ) # has this element appeared before: TRUE/FALSE [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
which( duplicated( numbers ) ) # at which positions are duplicates[1] 4 9 10 11
All and any
The following logical vector shows which fruits are apple (or not apple):
fruits[1] "apple" "banana" "lemon" "apple" "raspberry" "pear"
[7] "apple" "lemon"
fruits == "apple" # is this element "apple": TRUE/FALSE[1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
fruits != "apple" # is this element not "apple": TRUE/FALSE[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
Are there any apple elements in the vector:
any( fruits == "apple" ) # is there any element "apple": a single TRUE or FALSE[1] TRUE
Or, are all fruits not apple:
all( fruits != "apple" ) # are all elements not "apple": a single TRUE or FALSE[1] FALSE
Some examples with numbers:
numbers [1] 3 1 4 1 5 9 2 6 5 3 5
numbers >= 5 & numbers < 7 # is this element greater/equal 5 and less than 7: TRUE/FALSE [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
any( numbers >= 5 & numbers < 7 ) # is there any element greater/equal 5 and less than 7, a single TRUE or FALSE[1] TRUE
Selecting elements
What are the other fruits than apples (duplicates possible):
fruits[ fruits != "apple" ][1] "banana" "lemon" "raspberry" "pear" "lemon"
fruits[ !(fruits == "apple") ][1] "banana" "lemon" "raspberry" "pear" "lemon"
At which positions is the lemon:
which( fruits == "lemon" ) # not a logical vector, but integer positions[1] 3 8
What are the numbers which are smaller than 5 or greater/equal 7:
numbers[ numbers < 5 | numbers >= 7 ][1] 3 1 4 1 9 2 3
Producing text by merging elements
For reporting there is often a need to merge elements into a single text:
uqFruits <- unique( fruits ) # character vector, several elements
uqFruitsTxt <- paste( uqFruits, collapse = ", " ) # single text: fruit names separated by ,
txt <- paste0("Unique fruits: ", uqFruitsTxt, ".") # single text: a full sentence
txt[1] "Unique fruits: apple, banana, lemon, raspberry, pear."
Another example with numbers:
Descriptive statistics (exercise) 📖
Needs: “Vectors”, sum, min, max, median, fivenum, head, sort, tail, NA, “RelOps”.
Practices: length, sum, max, min, median, fivenum, head, tail, sort, sum, NA, “RelOps”.
You are given a numerical vector heights representing the heights(cm) of a group of individuals. For some reasons some of these individuals have missing value (NA) for their heights.
heights <- c(180, 170, NA, NA, 174, 178, 169, 174, NA, 174, 170, 174, 173,
168, 170, 166, NA, 187, 170, 164, NA, 167, 178, 165, 164)- How many individuals are in the group?
- How many individuals have their height not reported?
- What is the height range (minimum and maximum)?
- If the individuals are ordered by their height in a row, what is the height of the individual standing in the middle of the row? If no middle exists in case of odd number of reported heights then take the average of the two in the middle. Which function in R does exactly this?
- Try
fivenum(see ?fivenum) onheightsvector and inspect the results. - Extract and print 5 known lowest and 5 known largest heights.
- How many individuals are within the range (175,180), excluding both 175 and 180?
- How many individuals are at most 175 cm or at least 180 cm tall?
- Create a variable
validHeightswith only the valid heights (removeNAvalues).
Solution…
# How many individuals
length(heights)
# How many individuals have their height not reported
sum(is.na(heights))
# Height range
max(heights, na.rm=TRUE)
min(heights, na.rm = TRUE)
range( heights, na.rm=TRUE)
# Middle height
median(heights, na.rm=TRUE)
# Fivenum gives min/median/max and two quartiles (see ?fivenum)
fivenum(heights)
# 5 lowest and 5 largest heights
head(sort(heights), 5)
tail(sort(heights), 5)
# Number of individuals within the range (175,180), uses & (logical AND)
sum(heights > 175 & heights < 180, na.rm=TRUE)
# Number of individuals at most 175 cm or at least 180 cm tall, uses | (logical OR)
sum(heights <= 175 | heights >= 180, na.rm=TRUE)
# Valid heights, uses negation of is.na result
validHeights <- heights[!is.na(heights)]Snacks in a kindergarden (exercise) 📖
Needs: “strsplitExample”, length, “[…]”, “:”, sort, unique, “==”, “!=”, any, all, which, duplicated.
Practices: length, “==”, any, unique, duplicated, sort, which.
In a kindergarten there are children, each of a different name. The children are eating snacks. Each child should get exactly one snack. Here, in nms, there are the names of the children in the order of picking snacks (every time a single snack was picked):
nmsTxt <- "Oliver Mia Walter Charlie Liam Kate Hank Gina Zoe Eva Xena Tom Nora Jack Cathy Helen Yuri Rita Ivan Ben Adam Daisy Liam Quinn George Sam Vera Pam Eve Bob Alice Frank David Fred Yuri Mia Yuri Ursula"
nms <- strsplit( nmsTxt, " " )[[1]] # split the text into words, will be introduced later- How many snacks were picked?
- How many kids picked a snack?
- Did
Janeget a snack? - Are there children who picked a snack more than once?
- What are the names of the children who picked a snack more than once (but don’t mention any child twice)?
- Can you produce an alphabetical list of children names, skipping duplicates?
- Assuming that
Evapicked exactly one snack: how many snacks were picked beforeEvadid, and how many after? -
Fredalso took just one snack: what are the unique names of the children who picked a snack afterFred?
Solution…
# num of snacks taken
length(nms)
# num of kids
length(unique(nms))
# Jane got a snack?
any(nms == "Jane")
# more snacks picked than kids?
length(unique(nms)) < length(nms)
# which kids took more than one snack?
unique(nms[duplicated(nms)])
# alphabetical list of kids
sort(unique(nms))
# how many kids before "Eva"?
which(nms == "Eva")-1
# how many kids after "Eva"?
length(nms) - which(nms == "Eva")
# kids after "Fred"
unique( nms[(which(nms == "Fred")+1):length(nms)] )Sampling from the normal distribution (lecture) 👨🏫
Needs: “callingFunctions”, “console”, head, min, max.
Gives: rnorm, hist, quantile, median.
The function rnorm can be used to sample data from the normal distribution:
[1] 0.02655213 -1.62561586 0.25029381 -0.09848593 0.08729105 0.02644690
[7] 1.53498811 0.71997270 1.20024275 -0.64711427
It is possible to specify the mean and standard deviation of the distribution:
[1] 11.469629 9.112028 12.313071 9.841733 10.309058 9.981308 11.076697
[8] 10.848701 11.557499 9.852424
A simple histogram can be created with hist to visualize the distribution of the sampled data:
hist( vs )The hist function has many arguments to customize the plot. For example (check ?hist for more details):
hist( vs, breaks=20, col="lightblue", main="Sampled with rnorm(500,10,2)" )The quantile function can be used to calculate quantiles of the data (using many different methods). For example, to calculate the 0, 0.1, 0.25, 0.5, 0.75, 0.9 and 1 quantiles:
0% 10% 25% 50% 75% 90% 100%
3.949082 7.752900 8.790042 10.109324 11.331642 12.485473 17.302834
Compare the calculated quantiles with:
Factors (lecture) 👩🏫
Needs: “Vectors”, NA, length, str, c, “[…]”, NA.
Gives: “Factors”, levels, nlevels, as.integer, table, names.
A factor is a vector of categorical values (values chosen from a set of predefined possibilities, called levels).
For example, a factor can be used to store hair color of some people. Let’s assume that we can classify each person to one of the following hair colors. This will be the levels of the factor:
c( "black", "brown", "ginger", "blond", "gray", "white" )Now, let’s assume that for 10 people we observed the following hair colors.NA (not available, missing value) stands for a person whose hair color could not be determined.
This will be the values of the factor:
c("brown","ginger","black","brown",NA,"brown","black","brown","blond","gray") To create the factor and store it in a variable hairColors, do:
hairColors <- factor(
c("brown","ginger","black","brown",NA,"brown","black","brown","blond","gray"),
levels = c( "black", "brown", "ginger", "blond", "gray", "white" )
)
hairColors [1] brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
Factor properties and elements
A short presentation of a factor can be obtained with the str function:
str( hairColors ) Factor w/ 6 levels "black","brown",..: 2 3 1 2 NA 2 1 2 4 5
The length of the factor is the number of observations:
length( hairColors )[1] 10
And elements of the factor can be accessed by the square bracket operator:
hairColors[ 3:6 ][1] black brown <NA> brown
Levels: black brown ginger blond gray white
Elements of a factor can also have names. For example:
names( hairColors ) <- c( "Alice", "Bob", "Charlie", "Daisy", "Eve",
"Frank", "George", "Helen", "Ivan", "Jack" )
hairColors Alice Bob Charlie Daisy Eve Frank George Helen Ivan Jack
brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
Naturally, the names can be used to access the elements:
hairColors[ "Alice" ] <- "ginger" # this changes the hair color of Alice
hairColors Alice Bob Charlie Daisy Eve Frank George Helen Ivan Jack
ginger ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
Levels and coding
In R each factor is stored as two vectors.
The first vector codes the levels with subsequent integers: 1=black, 2=brown, …
Type the following to get this vector:
levels( hairColors )[1] "black" "brown" "ginger" "blond" "gray" "white"
and to find the number of factor levels enter:
nlevels( hairColors )[1] 6
The second vector keeps integers corresponding to the factor values.
Try the following and understand the relation of the integers and the levels:
as.integer( hairColors ) [1] 3 3 1 2 NA 2 1 2 4 5
Count occurrences
Use the table function to count occurrences of each level of a factor:
A factor from raw data, data cleaning (exercise) 📖
Needs: “Vectors”, table, NA.
Gives: “Factors”, factor, addNA.
Practices: factor, addNA, c, NA, table.
Raw data
Let’s assume, that an expensive survey was conducted, and the participants were asked to fill in one of the following values: low, medium, high. Some participants made typos. The data needs to be cleaned, trying to preserve as much information as possible.
The following chunk of code produces a character vector vs with the values obtained from the survey.
vsTxt <- "low low low low medium L low medium medium low low low low low ? low looow low low high low low mediun medium low right low low low medium HIGH low low low low low low low Medium low low low low low high medium low low low medium low low low high low low medium low high medium medium low medium high L ?"
vs <- strsplit( vsTxt, " " )[[1]] # split the text into words, will be introduced later
vs[10] <- NA
head( vs, 16 ) # the first 16 responses [1] "low" "low" "low" "low" "medium" "L" "low" "medium"
[9] "medium" NA "low" "low" "low" "low" "?" "low"
Building a factor
The goal is to create a factor fs from the corrected values of the vs vector:
- Print, how many raw responses are there.
- Create the factor with the levels specified in the
"low","medium","high"order, and then add theNAlevel with theaddNAfunction. - Finally, print the counts of each of the levels in the factor.
Modifications of the raw input are not allowed (don’t correct the words in vsTxt manually). Instead, create a named character vector m with names corresponding to the raw spelling, and values corresponding to the correct levels, e.g. m <- c( "HIGH"="high", ... ). Then, map the vs vector through m using the square bracket operator. When you can’t decide what should be the level, use NA.
Note: For a tidyverse solution, you could use fct_count to count the levels, fct_collapse to map the words to correct levels, fct_relevel to reorder the levels.
Solution…
### base-R variant
length( vs )
table( vs )
m <- c( "?"=NA, "high"="high", "HIGH"="high", "medium"="medium", "mediun"="medium", "Medium"="medium", "looow"="low", "low"="low", "L"="low" )
table( m[ vs ] )
fs <- factor( m[ vs ], levels = c( "low", "medium", "high" ) )
fs <- addNA( fs )
table( fs )
### tidyverse variant
fs <- factor( vs )
fct_count( fs )
fs <- fct_collapse( fs, low = c( "L", "low", "looow" ), medium = c( "medium", "Medium", "mediun" ), high = c( "HIGH", "high" ), other_level = NA )
fct_count( fs )
fs <- fct_relevel( fs, "low", "medium", "high" )
fct_count( fs )Generate sample of female heights (exercise) 📖
Needs: “Vectors”, “RelOps”, round, rnorm, names, mean, sd, hist, sort, quantile.
Gives: sample.
Practices: rnorm, round, sample, names, mean, sd, hist, sort, head, quantile.
Let’s assume that you are expected to work with data describing heights of female students.
The real data is not available yet, so you need to generate a random sample.
Generate
Generate a named vector heights:
- There should be
studentsNumelements in the vector. - Values should be sampled from the normal distribution with a mean
femaleMeanHeightand a standard deviationfemaleSdHeight. - Names should be randomly sampled from the provided
femaleNamesvector. Find out how to use thesamplefunction to perform this task.
studentsNum <- 30
femaleMeanHeight <- 170.7 # cm
femaleSdHeight <- 6.3 # cm
femaleNamesTxt <- "Anastasia,Melissa,Nora,Marlene,Isabel,Eleonora,Amelie,Katharina,Leonie,Sophie,Zoe,Vera,Julia,Magdalena,Elena,Karina,Serena,Viviana,Carmen,Ella,Anita,Ines,Beatrice,Susanne,Katja,Yvonne,Noemi,Miranda,Sonja,Clara,Valentina,Luisa,Victoria,Stefanie,Elisabeth,Martina,Sophia,Sara,Lea,Lorena,Patricia,Emma,Tanja,Nicole,Silvana,Martha,Selina,Melanie,Jolanda,Lucia,Isabella,Felicia,Giulia,Marina,Samantha,Jasmin,Lucie,Corina,Luna,Matilda,Jessica,Lena,Francesca,Alessia,Claudia,Lavinia,Nadia,Regina,Carolina,Adriana,Natalie,Nina,Sylvia,Livia,Miriam,Maria,Mia,Diana,Fiona,Patrizia,Helena,Judith,Laura,Gabriela,Liliana,Tara,Selena,Mona,Paula,Valeria,Verena,Tatjana,Ariana,Alina,Anna,Rebecca,Teresa,Marisa,Gloria,Luciana"
femaleNames <- strsplit( femaleNamesTxt, "," )[[1]]Study
Study the generated heights vector:
- Find the mean and standard deviation of the
heights. Are these values far from the parameters used to generate theheights? - Produce the histogram of the
heights. - Find the names of students with heights above the sample mean.
- Print the heights and the names of the 5 tallest students (sorted with decreasing heights).
- What are the heights corresponding to the quantiles 0.3 and 0.7?
Data manipulation 1/3
The pipe operator (lecture) 👨🏫
Needs: rnorm, mean, round, sum.
Gives: “|>”, “%>%”.
We want to calculate the mean of 100 randomly generated values from the normal distribution and round it to two decimal places. A step by step solution in statements would look like this:
where two variables vs and vsMean were introduced. A more compact solution would be to call the functions in composition without introducing variables such vs and vsMean:
R version 4.1.0 introduced the pipe operator |> (earlier %>% in magrittr R package part of tidyverse). It allows to write the code in a “chained” manner. The pipe operator passes the result of the left-hand side expression as the first argument of the right-hand side function. The expression sum(1:10) is equivalent to 1:10 |> sum().
The pipe notation represents the “data flow” in the expression towards the end result, it also corresponds well to the order of the solution to the problem:
The tidyverse library is carefully designed to work well with the pipe operator.Complex data manipulation tasks can be written in a compact and readable way.
Tibble (lecture) 👨🏫
Needs: “Vectors”.
Gives: tibble, “[[”, “[”.
In previous session we imported existing tables as tibble, a data type defined in tidyverse package. To create a tibble manually in R we use the function tibble:
d <- tibble(
name = c("Gabriela", "Gale", "John", "Guy", "George"),
sex = c("female", "female", "male", "male", "male"),
age = c(17, 17, 20, 18, 17))
d# A tibble: 5 × 3
name sex age
<chr> <chr> <dbl>
1 Gabriela female 17
2 Gale female 17
3 John male 20
4 Guy male 18
5 George male 17
You can extract variables(columns) as vectors with [[ and $, or as a separate tibble with [.
d[["name"]] # <=> d$name : extracts 'name' as a vector
d["name"] # extracts 'name' as tibble
d[c("name", "age")] Similar to lists, variables in a tibble may be updated with an assignment <-:
With the pull function (tidyverse) we can extract a (named) vector. Recall that we can extract vectors from a tibble with [[ or $, however the advantage of the pull function is that it can be used in a pipe expression.
The pipe operator (exercise) 📖
Needs: “|>”, “%>%”, sort, tail, mean, round, rnorm, paste.
Practices: “|>”, “%>%”.
Solve the following two tasks using the non-pipe and the pipe versions of the code:
A vector with first names of some people is given, e.g.
c( "Tom", "Jerry", "Mickey", "Donald" ). Sort the names in the reversed alphabetical order and concatenate them into a single string separated by commas (hint: seepastefor character concatenation).A vector of person heights is given, e.g.
hs <- rnorm( n=100, mean=170, sd=10 ).
Calculate the mean height of the 5 tallest persons. Round the result to one decimal place.
Tibble (exercise) 📖
Needs: tibble, mean, “RelOps”, “|>”.
Practices: tibble, mean, “RelOps”, “|>”.
Create a tibble
exercise_groupfor a group of individuals with names {Sonja, Steven, Ines, Robert, Tim} with theirheight{164, 188, 164, 180, 170},weight{56.0, 87.0, 54.0, 80.0, 58.5} and frequency ofexercise{high, high, low, moderate, low}.Add a new numeric variable
sex(1=‘male’, 2=‘female’) toexercise_grouptibble.Calculate the mean height for males and females.
Add a new candidate Eve to
exercise_groupwith height=170, weight unknown and exercise levelhigh.Calculate the mean weight. Use the pipe operator and thereby avoid creating intermediate variables.
Create a named vector of weights of the individuals with high exercise level.
Solution…
# 1)
exercise_group <- tibble(name=c("Sonja" , "Steven", "Ines", "Robert", "Tim" ),
height=c(167, 185, 164, 180, 172),
weight=c(56.0, 87.0, 54.0, 80.0, 58.5),
exercise=c("high", "high", "low", "moderate", "low")
)
# 2)
exercise_group[['sex']] <- c(2,1,2,1,1)
# 3)
#
# Base R solution is used here, try this with 'filter' which will
# be explained later.
#
heights <- exercise_group[["height"]]
females <- exercise_group[["sex"]]==2
males <- exercise_group[["sex"]]==1
mean(heights[males])
mean(heights[females])
# 4)
exercise_group <- tibble(name = c( exercise_group$name, "Eve"),
height = c( exercise_group$height, 170),
weight = c( exercise_group$weight, NA),
exercise = c( exercise_group$exercise, 'high'),
sex = c( exercise_group$sex, 2)
)
# 5)
exercise_group |> pull(weight) |> mean(na.rm=TRUE)
# 6)
weight_levels <- exercise_group |> pull(weight, name)
weight_levels[ (exercise_group |> pull(exercise, name)) == "high"]Select: choosing columns (lecture) 👩🏫
Files: pulseNA.csv.
Needs: “pulseNA.csv”, read_csv, head, tail, “tidyverse”, “|>”, “%>%”.
Gives: select, rename.
The select function (from tidyverse) is used to select variables(columns) of a table. Given a table the select can be used to:
- take a collection of variables and simultaneously reorder and/or rename the variables
- reshuffle the variables order
You’ll need to load tidyverse library to be able to use select, also import the dataset pulseNA.csv into the variable d:
Execute the following code and observe the results (note, the d table is not modified as there are no assignments to d):
d # all cols, all rows
d |> head() # top rows (default 6)
d |> select( weight, height ) # 2 columns, all rows
d |> select( gender, weight, height ) |> head(n=3) # 3 columns, 3 top rows
d |>
select( gender, weight, height, everything() ) |>
head(n=3) # top 3 rows of reordered columns
# Note: |> multi-line statement
d |> select( -id, -name, -year ) |> head() # no id, name, year columns
d |> tail() # last rows (default 6)
d |> select( WEIGHT=weight, height ) |> tail() # 2 columns (one renamed)
d |> rename( WEIGHT=weight ) |> tail() # all columns, but one renamedNote, the lines above do not modify the d table. If you want to store a modified table, you need to assign the result to a variable:
Filter: choosing rows (lecture) 👨🏫
Files: pulseNA.csv.
Needs: “pulseNA.csv”, “|>”, head, tail, “tidyverse”, is.na.
Gives: filter, “%in%”, “RelOps”.
The filter function is used to select rows of a table that meet certain conditions.
Manually run the pieces of code given below. Compare the output of the following lines of code to each other. Understand how you filter rows based on different conditions. Observe, how to build longer pipes by combining filtering with sorting (or with calculating the number of rows).
Load the pulseNA.csv file into the d variable. The filter function comes from the tidyverse library, so you need to load it with library(tidyverse).
Multiple conditions Logical conditions may be combined using and (‘&’) and or (‘|’) operators where:
-
A & B: both conditionsAandBmust beTRUE -
A | B: at least one ofAorBmust beTRUE
d |> filter( weight < 50 | weight > 100 ) # OR
d |> filter( weight >= 80 & weight < 90 ) # 80 yes, 90 no (AND)
d |> filter( between( weight, 80, 90 ) ) # 80 yes, 90 yes
d |> filter( weight >= 80, weight < 90 ) # 80 yes, 90 no (comma and AND)
d |> filter( weight >= 80 ) |> filter( weight < 90 ) # 80 yes, 90 no (two steps are AND)Negation operator (‘!’) takes the complement of the logical expression, i.e. if the outcome of logical expression A is TRUE then !A is FALSE and vice versa.
‘%in%’ operator Tests whether the elements of a vector are present in another and returns a logical vector of the size of the left operand.
d |> filter( gender == "male" )
d |> filter( !( gender == "male" ) )
d |> filter( gender != "male" ) # equivalent to the previous line
d |> filter( weight < 50 | weight > 100 ) |> filter( gender != "male" )
d |> filter( gender == "male" ) |> nrow()
d |> filter( exercise == "low" )
d |> filter( exercise == "low" | exercise == "moderate" )
d |> filter( exercise %in% c( "low", "moderate" ) )
selExercises <- c( "low", "moderate" )
d |> filter( exercise %in% selExercises )
d |> filter( !is.na( exercise ) ) # remove rows where exercise is missing
d |> filter( !is.na( exercise ) | !is.na( pulse2 ) ) # remove rows where
# exercise or pulse2 is missingArrange: sorting rows (lecture) 👩🏫
Files: pulseNA.csv.
Needs: “pulseNA.csv”, 🔴“LcPipeOperator”, 🔴“LcSelect”, head, tail, 🔴tidyverse.
Gives: arrange.
The arrange function (tidyverse) is used to sort rows of a table based on the values in one or more columns.
As it was the case for select and filter you’ll need to load tidyverse library to be able to use arrange, also import the dataset pulseNA.csv:
Run the code below. Compare the output of the following lines of code to each other. Understand how you sort rows of the table and print table parts.
Also compare the output between each of these longer table manipulation “pipe” expressions:
The tidyverse library is designed to make the code more readable and easier to understand. For example, the last expression given above is equivalent to the following base-R code:
Filter (exercise) 📖
Files: pulseNA.csv.
Needs: “pulseNA.csv”, “|>”, “%>%”, head, tail, “tidyverse”, filter, “%in%”, “RelOps”.
Practices: filter, “%in%”, “RelOps”.
Import pulseNA.csv into variable d:
Filter the rows of the
dtable to select only the rows where theageis one of:18or21. Propose two ways to do this.Filter the rows of the
dtable withweightmore than60but not more than70.Filter the rows of the
dtable to select only the rows where there is missing data on exercise and the participant was running.Filter the rows of the
dtable to select only the rows where theexerciseis not missing and the participant is drinking alcohol.
Solution…
# 1)
d |> filter( age == 18 | age == 21 )
d |> filter( age %in% c( 18, 21 ) )
# pre-selected set of ages in 'selAges'
selAges <- c( 18, 21 )
d |> filter( age %in% selAges )
# 2)
d |> filter( weight > 60 & weight <= 70 )
# 3)
d |> filter( is.na( exercise ), ran == TRUE )
# 4)
d |> filter( !is.na( exercise ), alcohol == "yes" )Mutate: modifying the table (lecture) 👩🏫
Files: pulseNA.csv.
Needs: 🔴“surveyNA.csv”, 🔴“LcPipeOperator”, 🔴“LcSelect”, 🔴tidyverse.
Gives: if_else, mutate.
The mutate function is used to add or modify (new) variables(columns) in the table. It can use the values of existing variables to calculate the values of new ones. This operation does not change the number of rows in the table.
pulse <- read_csv( "pulseNA.csv" , show_col_types = FALSE)
pulse |>
select( height ) |>
mutate( isVeryTall = height > 190 ) |>
head()
pulse |>
select( height ) |>
mutate( howTall = if_else( height > 190, "very", "not_very" ) ) |>
head()
pulse |>
select( pulse1, pulse2 ) |>
mutate( pulseDiff = pulse2 - pulse1 ) |>
head()
pulse |>
select( pulse1, pulse2 ) |>
mutate( minPulse = pmin(pulse1, pulse2), maxPulse = pmax(pulse1, pulse2) ) |>
head()
pulse |>
select( pulse1, pulse2 ) |>
mutate(
minPulse = if_else( pulse1 < pulse2, pulse1, pulse2 ),
maxPulse = if_else( pulse1 < pulse2, pulse2, pulse1 ) ) |>
head()Mutate: modifying the table (exercise) 📖
Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcMutate”, “pulseNA.csv”, 🔴“survey.csv”, if_else, 🔴“LcMutate”, 🔴tidyverse.
Gives: case_when, geom_line.
- Take the following vectors
weight(pound) andheight(inch) and store them as variables(columns) of a tibble. Add a column BMI to this tibble where \(BMI=weight_{(kg)}/height_{(m)}^2\), and round the BMI values to one decimal place. Note that \(1(in) = 0.0254(m)\) and \(1(lb) = 0.4535923(kg)\).
- The following table is one of many BMI classifications:
| classification | bmi | |
|---|---|---|
| 1 | underweight | <18.5 |
| 2 | normal | 18.5-24.9 |
| 3 | overweight | 25-29.9 |
| 4 | obese | >=30 |
Add the variable BMI_class to the pulse tibble based on the classification in the table above. Use the function case_when as a better alternative for if_else.
Solution…
# 1
d <- tibble(
weight = c( 113, 137, 153, 142, 133, 123, 131, 136 ), # unit: pound
height = c( 66, 72, 69, 68, 68, 69, 70, 70 ) # unit: inch
)
d |>
mutate(weight_kg=weight*0.4535923, height_m=height * 0.0254) |>
mutate(BMI=round(weight_kg/(height_m)^2, 1) ) |>
select(weight, height, BMI)
# 2, bad solution, think about BMI set to 24.95
pulse<- read_csv("pulseNA.csv", show_col_types = FALSE)
pulse |>
mutate(BMI=round( weight/((height/100)^2), digits = 1)) |>
mutate(BMI_class = case_when(
BMI < 18.5 ~ "underweight",
BMI >= 18.5 & BMI <= 24.9 ~ "normal",
BMI >= 25 & BMI <= 29.9 ~ "overweight",
BMI >= 30 ~ "obese"
)
)
# 2, better solution
# case_when checks conditions in sequence, the second one is checked only if the first is FALSE,
# therefore in the second condition it is not necessary to check that the first was FALSE.
pulse |>
mutate(BMI=round( weight/((height/100)^2), digits = 1)) |>
mutate(BMI_class = case_when(
BMI < 18.5 ~ "underweight",
BMI < 25 ~ "normal", # here we know already that BMI >= 18.5
BMI < 30 ~ "overweight", # here we know already that BMI >= 25
TRUE ~ "obese" # TRUE means: all other cases
)
)Select & Filter: Survey dataset (exercise) 📖
Files: survey.csv, survey.csv.gz.
Needs: “|>”, “%>%”, head, tail, “tidyverse”, filter, nrow, select, is.na, range, “|>”, “%>%”, “%in%”, “RelOps”.
Gives: “survey.csv”.
Practices: filter, nrow, select, is.na, range, “|>”, “%>%”, “%in%”, “RelOps”.
Survey dataset
The survey data set is available as comma-delimited (.csv) file: survey.csv (or additionally compressed as survey.csv.gz). This data frame contains the responses of 233 Statistics I students at the University of Adelaide to a number of questions. Here we use a slightly modified version of the survey data from the R MASS pacakge.
| Variable | Explanation |
|---|---|
| name | Name of a participant |
| gender | Sex (male/female) |
| span1 | Span (distance from tip of thumb to tip of little finger of spread hand) of writing hand (cm) |
| span2 | Span of non-writing hand (cm) |
| hand | Writing hand of student (left/right) |
| fold | Fold your arms! which is on top? (right/left/neither) |
| pulse | Pulse measurement (rate per minute) |
| clap | Clap your hands! which is on top? (right/left/neither) |
| exercise | Frequency of exercise (freq/some/none) |
| smokes | How much the student smokes (heavy/regul/occas/never) |
| height | Height (cm) |
| m.i | whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (metric/imperial) |
| age | Age of the student (years) |
Queries
Select teenagers, assume
agerange between and including 10 and 19.Select all females with
pulseequal to 60.Select all male teenagers with
pulseabove 60.How many males do smoke and never exercise?
How many females never smoke and frequently exercise?
-
Produce the following tibbles:
6.1 Personal information {Name,Age,Gender,Height} of all teenagers.
6.2 Personal information of males with
heightbetween and inclusive 170 to 180. Has survey data missing
pulsevalues? If so how many? What aboutage?What is the percentage of males who never smoke and frequently exercise? Do the same for female.
What is the age range in teenagers? You may use the range function (?range).
How many males do smoke and never exercise? Use ‘%in%’ operator.
Add the variable
age_grouptosurveybased on the following age classification:
| classification | age | notation | |
|---|---|---|---|
| 1 | adult | >19 | (19,∞) |
| 2 | adolescent | >=10 and <=19 | [10,19] |
| 3 | child | >=1 and <=9 | [1,9] |
| 4 | infant | <1 | (-∞,1) |
Solution…
survey <- read_csv("survey.csv", show_col_types = FALSE)
# 1)
survey |> filter(10<=age & age<20)
# 2)
survey |> filter(pulse==60 & gender=="female")
# 3)
survey |> filter(pulse>60 & gender=="male" & (10<=age & age<20) )
# 4)
#
# The conditions are 'do smoke' and 'never exercise'.
# With 'do smoke' we have the categories {regul,occas,heavy} therefore 'or' (|)
# logical connective is in place. For 'never exercise' we have the category 'none'.
# Since both conditions must be true we will have the 'and' (&) logical connective.
survey |>
filter((smokes=="regul" | smokes=="occas" | smokes=="heavy") & exercise=="none" & gender=="male" ) |>
nrow()
# Alternatively, a shorter condition for 'smokes' is smokes!="never". It means accept
# all values for 'smokes' as long as it is not equal to "never" and those are
# exactly {regul,occas,heavy}.
survey |> filter(smokes!="never" & exercise=="none" & gender=="male" ) |> nrow()
# 5)
survey |> filter(smokes=="never" & exercise=="freq" & gender=="female") |> nrow()
# 6)
### 6,1
survey |> select(Name=name, Age=age, Gender=gender, Height=height) |>
filter(10<=Age & Age<20)
### 6.2
survey |> select(Name=name, Age=age, Gender=gender, Height=height) |>
filter(170<=Height & Height<=180 & Gender=="male")
# 7)
survey |> filter(is.na(pulse)) |> nrow()
survey |> filter(is.na(age)) |> nrow()
# 8) Extra exercise
# men
none_smoker_sportsmen <- survey |>
filter(survey, smokes=="never" & exercise=="freq" & gender=="male") |>
nrow()
total_men <- survey |> filter(gender=="male") |> nrow()
none_smoker_sportsmen / total_men
# women : replace the condition gender=="male" into gender=="female"
# 9)
survey |> filter(10<=age & age<20) |> pull(age) |> range()
# 10)
survey |> filter( (smokes %in% c("regul","occas", "heavy")) & exercise=="none" & gender=="male" ) |> nrow()
# 11)
mutate(survey, age_group = case_when(
age > 19 ~ "adult",
age >= 10 & age <= 19 ~ "adolescent",
age > 1 & age <= 9 ~ "child",
age <= 1 ~ "infant"
)
)Data manipulation 2/3
Data Read/Write (lecture) 👨🏫
Common Data File formats
-
.rds— R’s single-object data format [saveRDS(),readRDS()]. -
.rdaor.RData— R’s workspace data format, saving multiple R objects [save(),load()]. -
.csv— Comma-separated values, a common format for tabular data. [read_csv(),write_csv()] -
.tsvor.txt— Tab-separated or plain text files, often used for data files. [read_tsv(),write_tsv()] -
.xlsxor.xls— Excel spreadsheet files (supported withreadxloropenxlsxpackages). -
.json— JSON format, often used for web data (supported withjsonlitepackage). -
.xml— XML format, used for hierarchical data (supported withXMLorxml2packages). -
.sav,.sas7bdat,.dta— Proprietary formats from SPSS, SAS, and Stata (havenorforeignpackages). -
.yaml- Data serialisation language (yamlpackage) (yaml.org, pecification).
In R, data types vary in the level of meta-information they can hold, which refers to the ability to store additional descriptive data about the dataset beyond the raw values.
Vectors: The most basic structure, they hold atomic data (numeric, character, etc.) with minimal metadata like length and type.
Matrices and Arrays: Extend vectors by organizing data into two or more dimensions. They allow metadata like dimension names (
rownames,colnames,dimetc.) but are still limited in complexity.Data Frames: A more flexible format, storing heterogeneous data (different types in columns) with richer metadata like column names, class attributes for specific columns (e.g., factors), additional metadata about printing behavior.
Lists: Highly flexible, they can store any type of R object, including additional metadata and nested structures. Less flexible for tabular analysis.
Specialized Objects: Formats like
S3,S4, andR6objects can include extensive metadata via attributes, enabling complex data structures with custom methods.
The higher the level of abstraction (like tibbles or specialized objects), the more meta-information the format can hold.
WARNING: writing and reading between different data file formats may lead to loss of meta information.
Data Read/Write (exercise) 📖
Data Read/Write
-
Data format In the
survivalpackage use the small datasetamland export it to the following data formats givenFilenameandFunctions. Functions are available via thetidyverseandhavenpackage.
Filename Storage Format Functions Package
aml.csv ASCII Text CSV read_csv/write_csv tidyverse
aml.dta binary STATA read_dta/write_dta haven
aml.rds binary R read_rds/write_rds tidyverse (<=> readRDS/saveRDS)
aml.sas7bdat binary SAS read_xpt/write_xpt haven
aml.sav binary SPSS read_sav/write_sav haven
aml.tsv ASCII Text TSV read_tsv/write_tsv tidyverse- Prepare SPSS data in R The SPSS data format can:
- hold labels on each variable as a short description of the variable.
- label the values of a variable in case of categorical variables.
Labeling a variable (column) is achieved by assigning the attribute label to it with a value, with the value being the description of the column (see ?attr). The labels of a categorical variable are defined as a key/value pair (see ?haven::labelled).
Take the survival package table aml and create and SPSS equivalent object in R, where:
- variables have labels as described in the
amlhelp page (?survival::aml) - variable
aml$xhas values 1 and 2 for its levelsMaintainedandNonmaintainedrespectively
Finally export the created SPSS object to a file, say aml.sav and read it back into R and check whether the labels are maintained.
-
Excel sheets Read an excel sheet file of your own data using the function exported by
readxlR package, namelyread_xls()orread_xlsx(). You may need to installreadxlR package.
Solution…
# 1)
#
library(haven)
library(survival) # aml dataset
# export
aml |> write_sav("aml.sav") # SPSS
aml |> write_csv("aml.csv") # CSV
aml |> write_tsv("aml.tsv") # TSV
aml |> write_dta("aml.dta") # Stata
aml |> write_xpt("aml.sas7bdat") # SAS
aml |> write_rds("aml.rds") # Natvie R
# import
aml_sav <-read_sav("aml.sav")
aml_tsv <-read_tsv("aml.tsv", show_col_types = FALSE)
aml_csv <-read_csv("aml.csv", show_col_types = FALSE)
aml_dta <-read_dta("aml.dta")
aml_sas <-read_xpt("aml.sas7bdat")
aml_rds <-read_rds("aml.rds")
# The variables time and status are vectors whereas variable x is a factor with meta-information
# Inspect and compare the original aml$x class with all other imports.
#
# solution 1.1)
class(aml$x)
class(aml_sav$x)
class(aml_dta$x)
# etc.
# solution 1.2) Alternatively: use of map function
#
# Object names (objs) are checked for their type (class) one by one
# using map function. The map function will be explained in later
# lectures. See also ?base::get and ?setNames.
objs <- paste("aml", c("sav", "tsv", "csv", "dta", "sas", "rds"), sep="_")
map(setNames(objs,objs), ~ class(get(.x)$x))
# 2)
#
# SPSS variables and variable labels can be labelled. It is possible, using
# haven R package, to create these labels manually so they can be exported
# to a file for usage in the SPSS environment if needed.
#
# The variables are labelled using the R attributes assignments:
attr(x = aml$time,which = "label") <- "survival or censoring time"
attr(x = aml$status,which = "label") <- "censoring status"
attr(x = aml$x,which = "label") <- "maintenance chemotherapy status"
# Values of a variable can be labelled with haven::labelled function:
aml[["x"]] <- labelled(aml[["x"]], c(Maintained=1, Nonmaintained=2))
write_sav(aml,"aml.sav")
aml_sav <- read_sav("aml.sav")
# 3)
#
# Read one of your own excel sheets using `readxl` R package functions,
# e.g. read_xlsx. If you do not have an excel sheet then it is a good
# practice to create an excel sheet manually and try to read it into R
# environment.
#Descriptive statistics: (group) summaries (lecture) 👨🏫
Files: pulseNA.csv.
Needs: “pulseNA.csv”, “survey.csv”, 🔴tidyverse.
Gives: drop_na.
Often we are interested count, mean, sum, variance, standard deviation etc. of a variable. In the context of tidyverse this is achieved by the function summarise.
Prior to summarise you may be interested in a summary based on groups in the dataset. The groups may be constructed from one or more categorical variables. This is acheived by group_by function.
# group/summarise : mean age per exercise group
#
pulse |>
group_by(exercise) |>
summarise(mean_age=mean(age))
# handling missing values : is.na/drop_na
#
pulse |>
filter(!is.na(exercise)) |>
group_by(exercise) |>
summarise(mean_age=mean(age, na.rm=TRUE))
# drop all observations with missing value(s)
#
pulse |>
drop_na() |>
nrow()
# drop observations with missing gender value only
#
pulse |>
drop_na(gender) |>
nrow()
# group/summarise : mean age per exercise group with missing values removed
#
pulse |>
drop_na() |>
group_by(exercise) |>
summarise(mean_age=mean(age))
# summarise vs mutate ; note that summarise drops the group variable
# but mutate does not, to drop groups use ungroup()
#
pulse |> group_by(gender) |> summarise(size=n())
pulse |> group_by(gender) |> mutate(size=n())
pulse |> group_by(gender) |> mutate(size=n()) |> ungroup()
# group_by : multiple variables
#
# Note the message: 'summarise()` has grouped output by 'gender'.
# You can override using the `.groups` argument.'
#
pulse |> drop_na() |>
group_by(gender, exercise) |>
summarise(mean_age=mean(weight))
# silent the message on 'summarise' with multiple variables with the
# argument .groups='drop'
pulse |>
drop_na() |>
group_by(gender, exercise) |>
summarise(mean_age=mean(weight), .groups = 'drop')
# counts, frequency tables
pulse |> group_by(gender) |> summarise(size=n())
pulse |> count(gender)
pulse |> count(age,gender)
# group/mutate
#
pulse |>
count(gender, exercise) |>
group_by(gender) |>
mutate(proportion_within_gender=n/sum(n)) |>
mutate(percentage_within_gender=round(proportion_within_gender*100,1)) |>
arrange(gender, desc(percentage_within_gender)) |>
ungroup()Descriptive statistics: (group) summaries (exercise) 📖
Files: pulseNA.csv.
Needs: 🔴“LcSummarise”, “pulseNA.csv”, “survey.csv”, 🔴“LcMutate”, 🔴tidyverse.
Gives: slice_head.
Per gender, calculate the mean and the standard deviation of the pulse before the exercise. Find how to perform these calculations with ignoring missing values. Name the columns
meanPulseBeforeandsdPulseBefore.How many students were there in each year of the experiment?
Per year, calculate the number of students and the number of missing values in the
exercisecolumn. Provide the results in a single table with columnsyear,studentsNum,missingExerciseNum.For each gender and
runlevels, build a table with min, median, and max of known pulses after the exercise.Give a summary of number of missing age values per exercise group.
From the pulse dataset retrieve the top 3 tallest individuals in males and in females. Has the final groups defined then remove the grouping.
Solution…
# 1
d |>
group_by( gender ) |>
summarize( meanPulseBefore=mean(pulse1, na.rm=TRUE), sdPulseBefore=sd(pulse1, na.rm=TRUE) )
# 1 : alternative solution
d |>
filter( !is.na(pulse1) ) |>
group_by( gender ) |>
summarize( meanPulseBefore=mean(pulse1), sdPulseBefore=sd(pulse1) )
# 2
d |> count( year )
# 3
d |>
group_by( year ) |>
summarize( studentsNum=n(), missingExerciseNum=sum( is.na(exercise) ) )
# 4
d |>
filter( !is.na(pulse2) ) |>
group_by( gender, ran ) |>
summarize( minPulse=min(pulse2), medianPulse=median(pulse2), maxPulse=max(pulse2) )
# 5
pulse |> group_by(exercise) |> summarise(missings=sum(is.na(age)))
# 6
pulse |> drop_na(gender, exercise) |> # remove rows with NA in gender and exercise
group_by(gender) |> arrange(height) |> slice_head(n=3) |>
arrange(exercise) |>
ungroup()Reshape tables (lecture) 👨🏫
Files: pulseNA.csv, survey.csv.
Needs: “pulseNA.csv”, “survey.csv”, 🔴tidyverse.
Two main functions to manipulate the layout of the table, pivot_longer transforms the table from wide to long format and pivot_wider does the opposite, converts the table from long to wide format.
pivot_longer function allows collapsing ‘similar’ variables into one variable while guaranteeing the data set’s consistency. For example take the variables pulse1 and pulse2 in the following subset of the pulse data set:
# A tibble: 3 × 3
name pulse1 pulse2
<chr> <dbl> <dbl>
1 Bonnie 86 88
2 Melanie 82 150
3 Consuelo 96 176
We can transform the table as such that all pulse values are under a single variable, say pulse:
# 'pulse1' and 'pulse2' will become categories in 'pulse' variable
dfLong <- pulses %>%
pivot_longer(c(pulse1, pulse2), names_to = "pulse", values_to = "level")
dfLong# A tibble: 6 × 3
name pulse level
<chr> <chr> <dbl>
1 Bonnie pulse1 86
2 Bonnie pulse2 88
3 Melanie pulse1 82
4 Melanie pulse2 150
5 Consuelo pulse1 96
6 Consuelo pulse2 176
pivot_wider function does the opposite of pivot_longer, going from long to wide format:
dfWide <- dfLong %>% pivot_wider(names_from = "pulse", values_from = "level")
dfWide # is identical to the original 'pulses' tibblePlotting
Often reshaping of the data is needed for visualisation. For example we would like to compare pulse1 and pulse2 variables with a boxplot. You may plot pulse1 and pulse2 in separate plots one after the other:
# For this illustration you'll need to install.packages("gridExtra").
#
suppressMessages(library(gridExtra))
# pulse1 boxplot
bp1 <- ggplot(pulse |> drop_na()) + # remove missing
aes(y=pulse1) +
geom_boxplot()
# pulse2 boxplot
bp2 <- ggplot(pulse |> drop_na()) + # remove missing
aes(y=pulse2) +
geom_boxplot()
grid.arrange(bp1, bp2, ncol=2)As you can see the y-scale of each plot is set dynamically and can be misleading. To resolve this, we would need a single plot with aesthetic mapping of x being the categorical variable pulse and y the pulse values as was shown above.
ggplot(pulse |>
drop_na() |> # remove missing
pivot_longer( c(pulse1,pulse2), names_to = "pulse", values_to = "level")
) +
aes(x=pulse, y=level) +
geom_boxplot()Cross tables
Another situation where pivoting plays a role is with cross-tabulation. Cross-tables are used to display the distribution of two or more categorical variables in a compact way. Take the following count table of age,gender groups:
age_sex_group_counts <- tibble(sex = c("female", "female", "female", "male", "male", "male", "male"),
age = c(18, 19, 20, 18, 19, 21, 22),
count = c(1, 1, 2, 1, 2, 2, 1) )
age_sex_group_counts# A tibble: 7 × 3
sex age count
<chr> <dbl> <dbl>
1 female 18 1
2 female 19 1
3 female 20 2
4 male 18 1
5 male 19 2
6 male 21 2
7 male 22 1
Here for each age group there are two entries, one for male and one for female, when there was count data available. A better way to represent the data is shown below where sex categories are set as separate columns:
age_sex_group_counts |> pivot_wider(names_from = sex, values_from = count)# A tibble: 5 × 3
age female male
<dbl> <dbl> <dbl>
1 18 1 1
2 19 1 2
3 20 2 NA
4 21 NA 2
5 22 NA 1
Reshape tables (exercise) 📖
Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcSummarise”, “survey.csv”, “pulseNA.csv”, 🔴“LcMutate”, 🔴tidyverse.
- Produce the following count summary per
genderfrom variablesfoldandclapin thesurveydataset.
# A tibble: 2 × 4
gender left neither right
<chr> <int> <int> <int>
1 female 67 30 137
2 male 66 37 129
- From
pulsedataset reproduce the boxplot shown below.
Solution…
# 1
survey |>
select(gender,fold,clap) |>
pivot_longer(!gender, names_to = "action", values_to = "side") %>%
group_by(gender,side) |>
summarise(count=n(), .groups='drop') |>
pivot_wider(names_from = "side",values_from = "count")
# 2
ggplot(pulse %>%
drop_na() %>%
mutate(exercise=fct_relevel(exercise, c("low","moderate","high"))) %>%
pivot_longer(c(pulse1,pulse2), names_to = "pulse",values_to = "level") ) +
aes(x=exercise, y=level, color=pulse) + geom_boxplot()Real Data Analysis: Example data analysis problem (exercise) 📖
Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.
Task concept
This task demonstrates issues arising during analysis of real-world data.
We have artificially generated several tables with data and we introduced various problems in them. Identify the problems and decide how to handle them (e.g. correct, ignore, …).
At the end of the course, the introduced problems will be discussed.
It is intentional, that the description is incomplete, and the task is not fully specified.
Use your knowledge, creativity and search skills to solve the task.
Example research question
Let’s assume that there are two simple research goals:
- Study the distribution of BMI in different populations.
- Find whether BMI is related to expression of some genes.
Several files with anthropometric data (known to be generated in different labs in different countries) are provided.
Moreover, there is a file with gene expression data for some of the subjects from all countries.
Deliverables:
- A single, properly merged dataset containing all provided datasets (after cleaning and unification).
- A report demonstrating the properties of the data, the distributions of BMI in different populations and the relations of BMI to expressions of each of the genes.
Click the links above (after Files:), to download the data files.
Organisation of files/directories
This task should be developed in a completely new RStudio project.
In the project directory:
- There should be a subdirectory called
data:- All input files with data should be placed in this directory.
- The files should never be modified in any way.
- It is convenient to have a subdirectory
outs:- All automatically generated output files should be placed in this directory.
- Separate directory allows easy removal of all generated files, in case of need.
- Provide an R Markdown script producing the report and generating the combined dataset.
- Provide a
README.mdfile with a brief description of the project.
Sampling age distribution and age categories (exercise) 📖
Needs: sample.
Gives: cut, table, sample.
Practices: c, barplot, names, sample, hist, quantile, cut, table.
From the Eurostat database we extracted estimated numbers of people of each age (range 0-99) living in the Netherlands at some recent timepoint:
age2peopleNum <- c(
168269, 170352, 170475, 172564, 175965, 175404, 180139, 176908, 181784,
185590, 190635, 191482, 192005, 188439, 191716, 193571, 199395, 206526,
212170, 220073, 227052, 223141, 222346, 217582, 217364, 218670, 225921,
226586, 229437, 232336, 233373, 225765, 223129, 222512, 221707, 216407,
212812, 206718, 207274, 210873, 213552, 206261, 206100, 203072, 204502,
205704, 214821, 221789, 239466, 250796, 261082, 266183, 254782, 251611,
252467, 255486, 259306, 254400, 248828, 245545, 237918, 234688, 227404,
221405, 216249, 209604, 205025, 200158, 198908, 190078, 189246, 189135,
192126, 198903, 201625, 140595, 143352, 132521, 118605, 108982, 107050,
98198, 90976, 79680, 73223, 65613, 58404, 50597, 45188, 38009, 32204,
25186, 20334, 15452, 11867, 8684, 6348, 4353, 2924, 1837
)
age2peopleNum[[1]] # 168269 people of age 0-1
age2peopleNum[[2]] # 170352 people of age 1-2, etc.Distribution
- Produce a barplot of the
age2peopleNumdistribution of the number of people by age.
- For better visualisation, add names to the vector
age2peopleNum, so ages get visible in the barplot.
Sampling
- Find how to use the
probargument of thesamplefunction and generate a sample of 10000 ages from the distributed according toage2peopleNum. Store the sample in the variableas. - Use
histto show the histogram of the generated ages. The shape should be similar to the barplot. - What ages correspond to quantiles 0.1, 0.25, 0.5, 0.75, 0.9 of the sampled ages (hint:
quantile)?
Building age categories
- Find out how to use the
cutfunction on sampled ages to create a factor with the following age categories:infant(age below 1),child(1<=age<10),teenager(10<=age<18),adult(18<=age<67),senior(67<=age). Store the factor in the variablefs. - Use
table(fs, as)to count the number of people in each category and age: double check whether the assigned categories are correct (considerright=FALSEargument tocut).
Solution…
barplot( age2peopleNum )
ages <- 0:(length(age2peopleNum)-1)
names( age2peopleNum ) <- ages
barplot( age2peopleNum )
age2prob <- age2peopleNum / sum(age2peopleNum)
as <- sample( ages, size=10000L, replace=TRUE, prob=age2prob )
hist( as )
quantile( as, c(0.1, 0.25, 0.5, 0.75, 0.9) )
fs <- cut( as, breaks=c(0, 1, 10, 18, 67, 100), labels=c("infant", "child", "teenager", "adult", "senior"), right=FALSE )
table(fs, as)Grades report, set operations (exercise) 📖
Needs: “Vectors”.
Gives: intersect, setdiff, union, “SetOperations”.
Practices: c, mean, names, “[…]”, intersect, setdiff, union.
Learn set operations
Study the following code to understand R set operations:
Use set operations on named vectors
Alex has the following grades on several subjects:
|biology | 7.0|
|history | 6.0|
|math | 9.0|
|music | 7.5|
|physics | 8.0|
Hai, among other subjects, has taken similar subjects as Alex:
|biology | 8.0|
|computing | 9.0|
|english | 7.5|
|history | 7.0|
|math | 9.0|
Create the named numerical vectors gradesAlex and gradesHai from the tables above.
- Print both vectors (so the subjects and grades are visible).
- For each student what is the mean grade over subjects common to both?
- What subjects has Alex taken that Hai has not and vice versa?
- Show the distinct set of subjects taken by Alex and Hai.
Solution…
gradesAlex <- c(biology=7.0, history=6.0, math=9.0, music=7.5, physics=8.0)
gradesHai <- c(biology=8.0, computing=9.0, english=7.5, history=7.0, math=9.0)
gradesAlex
gradesHai
# Manually it could be done as follows:
common_subjects <- c("biology", "history", "math")
# But we prefer use of 'intersect':
common_subjects <- intersect(names(gradesAlex),names(gradesHai))
mean(gradesAlex[common_subjects])
mean(gradesHai[common_subjects])
setdiff(names(gradesAlex),names(gradesHai))
setdiff(names(gradesHai),names(gradesAlex))
union(names(gradesAlex),names(gradesHai))Data structures 2/2, Flow control
Lists (lecture) 👨🏫
Needs: “Vectors”, length, str, names.
Gives: “Lists”, list, “[[…]]”, names, length, str, “NULL”.
A list can be thought of as zero or more data elements (of any type), kept in order.
The elements can be accessed by their position (or index): first element, second element, …
Typically the elements are also named and consequently can be also accessed by the names.
A list allows to keep together related information.
For example, functions often return lists when their result consists of many different parts.
Create a list
Type the following code to manualy create a list and store it in a variable person:
List properties
The function str (structure) gives a compact display of data stored in a list:
str( person )List of 3
$ name : chr "Bob"
$ age : num 44
$ children: chr [1:3] "Amy" "Dan" "Eve"
Use length to find the number of elements in a list:
length( person )[1] 3
To get the names of the elements of a list, use the names function:
names( person )[1] "name" "age" "children"
Single element by position
An elements of a list can be accessed by its position (first, second, …) with help of [[...]] (double bracket) operator.
Try the following to get the first and the last elements:
You may also set the element by position:
person[[1]] <- "Alice"Observe, that the elements of the person list returned by [[...]] are of different classes:
Single element by name
Each of the names can be used with the [[...]] operator to get/set an element by name:
person[[ 'name' ]][1] "Alice"
person[[ 'name' ]] <- "Hans-Peter"
person[[ 'name' ]][1] "Hans-Peter"
person[[ 'children' ]][1] "Amy" "Dan" "Eve"
Subsetting a list, many elements
Use the single bracket [...] operator to subset a list.
Here, not an element is selected, but a new list is created with selected elements:
person[ "name" ] # a list with one element$name
[1] "Hans-Peter"
selElems <- c( "name", "children" )
person[ selElems ] # a list with two elements$name
[1] "Hans-Peter"
$children
[1] "Amy" "Dan" "Eve"
Adding and removing elements
Add a new element as follow:
person[[ 'smoker' ]] <- TRUE
person$name
[1] "Hans-Peter"
$age
[1] 44
$children
[1] "Amy" "Dan" "Eve"
$smoker
[1] TRUE
Or, change the element:
person[[ 'smoker' ]] <- FALSE
person$name
[1] "Hans-Peter"
$age
[1] 44
$children
[1] "Amy" "Dan" "Eve"
$smoker
[1] FALSE
To remove an element from the list, assign NULL (nothing) to it:
person[[ 'smoker' ]] <- NULL
person$name
[1] "Hans-Peter"
$age
[1] 44
$children
[1] "Amy" "Dan" "Eve"
Note what happens when you try to access a not present element:
person[[ 'isMarried' ]]NULL
Grades of students (exercise) 📖
Needs: “Lists”, “Vectors”, NA, length, min, mean, “[…]”, “[[…]]”, “NULL”.
Practices: length, “[…]”, “[[…]]”, min, mean, “NULL”, list.
There are four students: Alice, Bob, Charlie, and Daisy. Each student has several grades, and the number of grades is different for each student:
- Alice: 8, 9, 7.5, 8, 9
- Bob: 6, 7, 8, 7, 6, 5, 5, 8
- Charlie: 9, 7, 9
- Daisy: 8, ?, 6.5, 6.5, 7, 8, 7 (one grade is missing)
Create and inspect the list
Build a list grades with the grades of the students, then inspect it:
- How many students are in the list?
- What are the names of the students?
- Retrieve Bob’s grades.
- How many grades does Charlie have?
- What are the mean grades of Bob and Daisy?
- What is Alice’s lowest grade?
- Alice and Daisy work in a group “A” and Bob and Charlie in a group “B”. Create separate lists for these groups in variables
groupAandgroupB.
Solution…
grades <- list(
Alice = c( 8, 9, 7.5, 8, 9 ),
Bob = c( 6, 7, 8, 7, 6, 5, 5, 8 ),
Charlie = c( 9, 7, 9 ),
Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 )
)
length(grades)
names(grades)
grades[['Bob']]
length(grades[['Charlie']])
mean(grades[['Bob']], na.rm=TRUE)
mean(grades[['Daisy']], na.rm=TRUE)
min(grades[['Alice']], na.rm=TRUE)
groupA <- grades[c('Alice', 'Daisy')]
groupB <- grades[c('Bob', 'Charlie')]Modify the list
- Add Eve with grades: 7.7, 8.5, 9.
- Add Charlie’s new grades: 8.5 and 9.
- Bob has moved to another school. Remove Bob.
- The missing grade of Daisy has been found. Replace the missing with the new grade 7.
T-test returns a list (exercise) 📖
Needs: rnorm, mean, sd, “Lists”.
Gives: t.test.
Practices: rnorm, mean, sd, t.test, names.
The t-test is a statistical test used to determine if there is a significant difference between the means of two samples. (Here we will use the two-sample t-test, which is used to compare the means of two independent samples. There are some assumptions about the samples which must be met, but we do not discuss them here.)
Generate two samples from the same normal distribution
- Generate two vectors
xandysampled from the same normal distribution (rnormwith the same mean and the same standard deviation). The vectors should havenelements each (n=30). - Check the
meanandsd(standard deviation) of the generated samples.
Perform the t-test
- Find out how to use
t.testto perform the t-test on thexandyvectors. Store the result in a variableh. - The
t.testfunction returns a list with the results of the test. Investigate the list (usestrand/ornames) and extract the differenceestimateof the means and thep.value. - Interpret the
p.value. Since thexandysamples originate from the same distribution, in most cases the t-test should not reject the null hypothesis that the means are equal. Does the observedp.valueconfirm this?
Repeat sampling and testing
- Repeat manually all steps several times. Will the
p.valuebe always above 0.05 or you observe a false positive?
Generate two samples from distributions with different means
- Finally, generate the
xandyvectors with clearly different means (but the same standard deviation). - Repeat the t-test. Is the result significant or you observe a false negative?
Matrices (lecture) 👩🏫
Needs: “Vectors”, length, str, names.
Gives: “Matrices”, matrix, “[.,.]”, nrow, ncol, dim, colnames, rownames, rowMeans, rowSums, colMeans, colSums, t, “%*%“.
A matrix is a two-dimensional, rectangular structure with all elements of the same type.
The elements can be accessed by their two-dimensional position: [row, column].
The rows and columns of the matrix might additionally be named and consequently can be also accessed by a pair [rowname, colname].
Create a matrix
A matrix can be manually created from a vector.
Depending on the arguments, elements of the vector are put into the matrix in a different order (column-by-column or row-by-row):
m <- matrix( 1:6, nrow = 2 ) # column-by-column
m [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
m <- matrix( 1:6, nrow = 2, byrow = TRUE ) # row-by-row
m [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Test the class of m:
class( m )[1] "matrix" "array"
Try the function str to get a compact summary of matrix content:
str( m ) int [1:2, 1:3] 1 4 2 5 3 6
Dimensions
There are several functions to get dimensions of a matrix.
Number of columns is provided by:
ncol( m )[1] 3
Number of rows is provided by:
nrow( m )[1] 2
Dimensions (two elements: number of rows and number of columns):
dim( m )[1] 2 3
Names of the columns and rows
colnames( m ) and rownames( m ) are used to set and get the vectors with columns and rows names.
Try to set the names:
m <- matrix( 1:6, nrow = 3 )
m [,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
A B
[1,] 1 4
[2,] 2 5
[3,] 3 6
A B
X 1 4
Y 2 5
Z 3 6
To get the names use:
Accessing matrix elements
Use single brackets notation with two arguments [ row(s), col(s) ] to get/set elements.
Note, when getting only a single row or a single column the result is returned as a vector, not as a matrix.
m[ 3, 1 ] # a single element as a vector of length 1[1] 3
m[ 3, ] # a single row as a vectorA B
3 6
m[ , 1 ] # a single column as a vectorX Y Z
1 2 3
m[ c( 2, 3 ), 1 ] # two rows/one column, still a vectorY Z
2 3
B A
Y 5 2
Z 6 3
B A
Y 5 2
Z 6 3
Notice the difference in the output class when only a single element is requested:
With the drop argument you may enforce that always a matrix is returned (never a vector):
m[ 3, 1 ] # a vector[1] 3
m[ 3, 1, drop = FALSE ] # a matrix A
Z 3
class( m[ 3, 1, drop = FALSE ] )[1] "matrix" "array"
Empty row/column argument field means all rows and/or all columns:
Useful matrix functions
Short summary of matrix operations: http://www.statmethods.net/advstats/matrix.html
Transposition
Matrix multiplication
Element-wise multiplication
m A B
X 1 4
Y 2 5
Z 3 6
m+1 A B
X 2 5
Y 3 6
Z 4 7
m * (m+1) A B
X 2 20
Y 6 30
Z 12 42
Row/columns means/sums
Correlation heatmaps (exercise) 📖
Needs: “Matrices”.
Gives: cor, cbind, heatmap.
Practices: rnorm, cor, cbind, heatmap.
Matrix from vectors; matrix heatmap
Let’s merge x1, x2 and x3 column-wise into a matrix using m <- cbind( x1, x2, x3 ).
Check the class of m.
Show the first few rows of m.
Try a simple heatmap visualisation of the matrix: heatmap( m, Colv = NA, Rowv = NA, scale = "none" ).
How are the vectors shown?
Which colors correspond to the lowest/highest matrix values?
Do the vectors appear correlated?
Matrix of correlations
Use cc <- cor( m ) to build the matrix of correlation coefficients between the columns of m.
Use round( cc, 3 ) to show this matrix with 3 digits precision.
Can you interpret the values?
Here is an example, how the heatmap of correlations could look like (of course, for your randomly sampled data it will look slighly different):
Solution…
size <- 12
# --- x and y have correlation around 0 ---
x <- rnorm( size )
y <- rnorm( size )
# --- x1, x2, x3 are correlated to each other ---
x1 <- x + rnorm( size )/10
x2 <- x + rnorm( size )/10
x3 <- x + rnorm( size )/10
# --- y1...y4 are correlated to each other ---
y1 <- y + rnorm( size )/10
y2 <- y + rnorm( size )/10
y3 <- y + rnorm( size )/10
y4 <- y + rnorm( size )/10
# --- matrix of all vectors ---
m <- cbind( y4, y3, x2, y1, x1, x3, y2 )
# --- calculating correlations of all matrix column pairs ---
cc <- cor( m )
# --- visualisation of correlations ---
heatmap( cc, symm = TRUE, scale = "none" )Solution…
# E.g. value for (row: x1, col: y1) is the corerlation of vectors x1, y1.
# Values of 1.0 are on the diagonal: e.g. x1 is perfectly correlated with x1.
# Correlations between x, x vectors are close to 1.0.
# Correlations between y, y vectors are close to 1.0.
# Correlations between x, y vectors are close to 0.0.Modelling weight from height, linear model (exercise) 📖
Needs: “pulseNA.csv”.
Gives: lm, summary, “Formulas”, plot, abline, “~”.
Practices: lm, summary, plot, abline, names, “~”.
Formula objects
Let’s assume that it is acceptable to use a linear model to model the weight of a person based on their height. R uses the formula notation to express the relationship between variables: weight ~ height. The formula is an abstract representation of the relationship between the variables. No calculations is performed at this stage.
height ~ weight # height depends on weightheight ~ weight
height ~ weight + gender # height depends on weight and genderheight ~ weight + gender
form <- weight ~ height # formulas are objects; can be stored in variables
formweight ~ height
Fitting a linear model: weight ~ height
- Load the
pulseNA.csvdataset. - Plot the
weightagainst theheightusing theplotfunction. - Fit out how to fit a linear model to the data with the
lmfunction. Use the formulaweight ~ height. Store the result in a variableh. Observe what is printed when you print thehobject. - Use the
summaryfunction to get a further summary of the linear model. Store the result in a variablesumH. Print it. - Plot again, use
plot( weight ~ height, data = d ) + abline(h, col="red"). - Treat
sumHas a list and find out what are the names of the elements of thesumHobject. Print thecoefficientsandadj.r.squaredelements. Compare their values with the printed output ofhandsumH. - Can you interpret the result?
Flow control: if/else, for (lecture) 👩🏫
Needs: “Vectors”, sample, intersect, names, sort.
Gives: “if”, “else”, “for”, “break”, “next”, seq_along, seq_len.
Conditional execution: if, else
The if statement allows to execute a block of code only if a condition is met. After if there can be an optional else statement with another block of code to execute if the if condition has not been met.
Let’s consider a simple example of a lottery bet:
myBet <- c( 1, 24, 26, 32, 36, 40 ) # 6 numbers from 1 to 45
lotteryDraw <- sort( sample( 1:45, 6 ) ) # lottery draw
lotteryDraw[1] 1 10 18 25 36 37
intersect( myBet, lotteryDraw ) # matched numbers[1] 1 36
[1] 2
The following code prints some text only if at least one number matched:
if( num > 0 ) {
print( "You matched at least one!" )
}[1] "You matched at least one!"
Additionally, a different text is printed if no number matched:
[1] "You matched at least one!"
The following code prints different text depending on the number of matched numbers:
Looping: for
A for loop repeats a block of code for each element of an iterable object (e.g. vector or list), in the same order.
A typical use-case is to loop over elements of a vector (or a list):
vs <- c( a=37, b=44, c=21, d=63, e=15 )
sum <- 0
for( v in vs ) { # loop over elements of vs; works always
sum <- sum + v # exectued many times for v being each element of vs
}
sum[1] 180
Another way is to loop over indices of a vector (or a list). Use seq_along to get a sequence of indices (it works correctly for the empty vector/list):
sum <- 0
for( i in seq_along(vs) ) { # loop over indices of vs; works always
sum <- sum + vs[[i]] # executed many times for i being each index of vs
}
sum[1] 180
For objects with named elements, it is possible to loop over names:
sum <- 0
for( n in names(vs) ) { # loop over names of vs elements; fails if no names
sum <- sum + vs[[n]] # executed many times for n being each name of vs
}
sum[1] 180
To simply repeat a block of code a fixed number of times, use seq_len:
repeatNum <- 4 # seq_len(4) is c(1,2,3,4), seq_len(0) is empty
for( i in seq_len(repeatNum) ) {
print( i ) # 1:4 id c(1,2,3,4), but 1:0 is NOT empty!
}[1] 1
[1] 2
[1] 3
[1] 4
A loop might be broken with the break statement:
vs <- c( 18, 33, NA, 61 )
anyMissing <- FALSE
for ( v in vs ) {
if( is.na(v) ) {
anyMissing <- TRUE
print( "Missing found" )
break
}
print( v )
}[1] 18
[1] 33
[1] "Missing found"
anyMissing[1] TRUE
To skip the rest of the loop and continue with the next iteration, use the next statement:
Flow control: functions (lecture) 👩🏫
Needs: “if”, “else”, sum, exp, 🔴log, length.
Gives: “function”, return, stopifnot, all.equal.
Function definition
Functions allow to encapsulate a block of code and execute (call) it multiple times. The name of the function allows to represent the block of code with a single word. The arguments of the function allow to pass different values to the block of code. The return value of the function allows to pass a single value back to the caller.
ageGroup <- function(age) { # define function `ageGroup`, one argument `age`
if( age < 18 ) { # function body...
return( "child" ) # ...return text "child" if age is less than 18
} else if( age < 65 ) {
return( "adult" ) # ...return text "adult" if age is less than 65
} else {
return( "senior" ) # ...return text "senior" otherwise
} # ...end of function body
}
ageGroup( 12 ) # call function `ageGroup` with first argument 12[1] "child"
ageGroup( age=35 ) # call function `ageGroup` with `age` argument 35[1] "adult"
Testing functions
When we develop a function, we always first test it. The stopifnot function allows to replace manual testing and save the tests for future (when potentially the function gets incorectly modified). The stopifnot function produces an error if the expression in its argument is not TRUE:
Default arguments
A function might have default arguments. The caller can omit the value of a default argument, and the default value will be used instead. Let’s define our own function to calculate the mean of a vector of numbers. The function will have an argument isGeometric to allow to calculate the geometric mean instead of the algebraic mean:
myMean <- function( vs, isGeometric=FALSE ) {
if( isGeometric ) {
res <- exp( sum(log(vs)) / length(vs) )
} else {
res <- sum(vs)/length(vs)
}
res
}
myMean( vs=c(2,4,8), isGeometric=FALSE ) # algebraic mean[1] 4.666667
myMean( vs=c(2,4,8), isGeometric=TRUE ) # geometric mean[1] 4
myMean( vs=c(2,4,8) ) # algebraic mean, default arg[1] 4.666667
myMean( c(2,4,8), TRUE ) # arguments matched by position[1] 4
myMean( isGeometric=TRUE, vs=c(2,4,8) ) # arguments matched by name[1] 4
When writing tests for equality of numerical results, it is better to use all.equal (close numerical result up to a certain tolerance) instead of == (exact equality):
Printing patterns (exercise) 📖
Needs: “for”, “if”, “else”, seq_len.
Gives: cat, stop.
Practices: “for”, “if”, “else”, cat, stop.
Understand, how the following code draws a rectangle from * symbols:
rowsNum <- 4
colsNum <- 15
fillChar <- "*"
for( r in seq_len(rowsNum) ) {
for( c in seq_len(colsNum) ) {
cat( fillChar ) # prints a single character
}
cat( "\n" ) # prints a newline character
}***************
***************
***************
***************
Encapsulate the code in a function
Define a function catFilledRect. The function should print a rectangle as shown above. The size of the rectangle should be provided as function arguments rowsNum and colsNum. The fill character should be provided as the third argument fillChar (with a default value of *).
Below you can see some examples of the expected output.
catFilledRect( rowsNum=3, colsNum=15, fillChar="X" )XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
catFilledRect( 4, 2 )**
**
**
**
Reuse one function in another
Write a function catSquare that prints a square of a given size. The function should use the catFilledRect function with appropriate arguments.
The function should work as follows:
Solution…
catSquare <- function( size, fillChar="*" ) {
catFilledRect( rowsNum=size, colsNum=size, fillChar=fillChar )
}catSquare( size=4, fillChar="X" ) # prints 4x4 square filled with XXXXX
XXXX
XXXX
XXXX
catSquare( size=2 ) # prints 2x2 square filled with ***
**
Add an error stop for nonsense arguments
Find out how to use stop to produce an error when a negative size is provided. Rewrite catSquare to produce the error as follows:
Solution…
catSquare <- function( size, fillChar="*" ) {
if( size < 0 ) {
stop( "Size must be non-negative!" )
}
catFilledRect( rowsNum=size, colsNum=size, fillChar=fillChar )
}catSquare( size=-1, fillChar="X" ) # we print an own error for size < 0Error in catSquare(size = -1, fillChar = "X"): Size must be non-negative!
Different shapes
Implement functions catRect, catBackslash, catSlash, and catX based on the examples shown below. The functions should print the shapes as shown. You may throw errors when arguments make no sense (not shown in the solutions).
catRect( rowsNum=5, colsNum=15, drawChar="*", fillChar="." )***************
*.............*
*.............*
*.............*
***************
catBackslash( size=6, drawChar="*", fillChar="." )*.....
.*....
..*...
...*..
....*.
.....*
catSlash( size=8, drawChar="*", fillChar="." ).......*
......*.
.....*..
....*...
...*....
..*.....
.*......
*.......
catX( size=8, drawChar="*", fillChar="." )*......*
.*....*.
..*..*..
...**...
...**...
..*..*..
.*....*.
*......*
Graphics
Introduction to ggplot2 (lecture) 👩🏫
Files: pulseNA.csv.
Needs: “tidyverse”, read_csv, “pulseNA.csv”, “|>”.
Gives: “ggplot2”, ggplot, geom_point, aes, drop_na.
The ggplot2 package (a part of the tidyverse library) is a powerful tool for creating visualizations in R. The package provides the grammar of graphics and allows to create complex plots with a simple syntax.
The main input is a table (a data.frame or a tibble) with the data to be plotted, and the main function is ggplot. A plot is built by adding layers to the plot object.
library(tidyverse) # or: library(ggplot2)
d <- read_csv( "pulseNA.csv" , show_col_types = FALSE )
plotData <- d |> drop_na( pulse1, pulse2, gender )
ggplot(plotData) +
geom_point( aes(x=pulse2, y=pulse1) )The layer provided by geom_point is used to create a scatter plot. Points in a scatter plot have coordinates given by the x and y aesthetics. Moreover, the points have color, shape, and size aesthetics. When an aesthetic is listed in the aes function, it refers to a column of the table. When it is listed outside of the aes function, it is a constant value.
Below, the x and y aesthetics are taken from the pulse1 and pulse2 columns of the table, whereas the color aesthetic is set to a constant darkgreen (see help for the function colors for other colors).
ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2), color="darkgreen" )Below, the color aesthetic uses the ran column of the table. Some colors are automatically chosen to represent the levels of the ran column:
ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=ran) )The size, shape, alpha (transparency level in 0..1 range) are other available aesthetics (see Aesthetic specifications).
ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )Customizing plot axes, colors, scales and dimensions (exercise) 📖
Files: pulseNA.csv.
Needs: “tidyverse”, “pulseNA.csv”, ggplot, geom_point, aes, “|>”, drop_na.
Gives: scale_color_manual, scale_shape_manual, xlab, ylab, ggsave, scale_color_continuous, scale_size_area, scale_x_continuous, scale_y_continuous, coord_fixed, geom_abline.
Practices: ggplot, aes, scale_color_manual, scale_shape_manual, xlab, ylab, ggsave, scale_color_continuous, scale_size_area, scale_x_continuous, scale_y_continuous, coord_fixed, geom_abline.
Categorical variables used for color and shape; axes labels
Start with a plot defined in the variable p as follows:
suppressPackageStartupMessages( library(tidyverse) )
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE ) |>
drop_na( pulse1, pulse2, gender )
p <- ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )
pFind, what needs to be further added to p to obtain the plot shown below:
- The functions
scale_color_manualandscale_shape_manualallow to customize the colors and shapes of the points used in the plot, when data is categorical. Theirvaluesargument is a named vector allowing to manualy enumerate shapes or colors (e.g.c("TRUE"="red", ...)). Thenameargument allows to specify the legend title (use\nto insert a line break). - The functions
xlabandylaballow to set the labels of the x and y axes.
Solution…
p <- p +
scale_color_manual( values=c("TRUE"="red", "FALSE"="blue"), name="Did student\nrun?" ) +
scale_shape_manual( values=c("female"=19, "male"=15), name="Gender of\nthe student" ) +
xlab("Pulse before exercise [1/min]") +
ylab("Pulse after exercise [1/min]")
pContinuous variables used for color, size, axes
Now, start with a different plot:
p <- ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=weight, size=weight), alpha=0.5 )
pFind, what needs to be further added to p to obtain the plot shown below:
- The function
scale_color_continuousallow to customize the colors representing a continuous variable. Thelowandhigharguments specify the colors for the smallest and the largest values of the variable. - The function
scale_size_areaallows to customize the sizes of the points representing a non-negative continuous variable. Themax_sizeargument specifies the size of the largest point. - The functions
scale_x_continuousandscale_y_continuousallow to specify more details of x and y axes. Thenameargument specifies the title of the axis. Thebreaksargument specifies the positions of the ticks on the axis. - The function
coord_fixedenforces the same physical size of a unit on both axes. This is useful, when the units of the x and y axes are the same. - The
geom_abline(slope=1)function adds a diagonal line to the plot. Other arguments allow to specify thecolorandtypeof the line.
Solution…
brks <- seq(40,180,10)
p <- p +
geom_abline( slope=1, intercept=0, color="black", linetype="dashed" ) +
scale_color_continuous( low="blue", high="red", name="Weight [kg]" ) +
scale_size_area( name="Weight [kg]", max_size=5 ) +
scale_x_continuous(name="Pulse before exercise [1/min]", breaks=brks) +
scale_y_continuous(name="Pulse after exercise [1/min]", breaks=brks) +
coord_fixed()
pSaving the plot to a file, changing dimensions
Given the plot in the variable p, find out how to use ggsave to save the plot to a file in a png format. Try to save the plot in two different image sizes.
Moreover, find out what parameters of an R chunk need to be set in order to change the plot size when the plot is knitted. For example like here:
Solution…
# {r fig.width=5,fig.height=4} are the parameters of the R chunk used here
pPlot types: histogram, boxplot, barplot (exercise) 📖
Files: pulseNA.csv.
Needs: “tidyverse”, “pulseNA.csv”, ggplot, geom_point, aes, factor, ylab.
Gives: geom_histogram, geom_boxplot, geom_bar, ggtitle, scale_x_continuous, geom_jitter, theme_dark, ylab, theme_bw.
Practices: ggplot, aes, geom_histogram, geom_boxplot, geom_bar, ggtitle, scale_x_continuous, geom_jitter, theme_dark, theme_bw, drop_na, ylab, factor.
Loading the data
Load the dataset as follows. Then find out how to recreate the following plots. Pay attention to the details. The solutions should include the functions/arguments/words listed at each point.
suppressPackageStartupMessages( library(tidyverse) )
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE)Histogram
In your solution use (among others): drop_na, geom_histogram, binwidth, boundary, fill, color, ggtitle, scale_x_continuous, breaks, seq, name. There should be no warning message.
Solution…
p <- ggplot(
plotData |> drop_na( pulse1 )
) +
geom_histogram(
aes(x=pulse1),
binwidth=20, boundary=80, fill="white", color="black"
) +
ggtitle( "Histogram of pulse before exercise" ) +
scale_x_continuous( breaks=seq(40,200,10), name="Pulse [1/min]" )
pBoxplot
Use: geom_boxplot, geom_jitter, outlier.color, theme_dark, ylab.
Solution…
p <- ggplot(plotData) +
aes( x = gender, y = weight ) +
geom_jitter( width=0.3, color = "orange" ) +
geom_boxplot( fill=NA, color="white", outlier.color=NA ) +
theme_dark() +
ylab( "Weight [kg]" )
pBarplot
Use: mutate, factor, levels, geom_bar, fill, color, alpha, theme_bw.
Plots with panels (faceting) (exercise) 📖
Needs: ggplot, aes, geom_point, 🔴facrtor, drop_na, mutate.
Gives: facet_grid, facet_wrap.
Practices: facet_grid, facet_wrap, drop_na, mutate.
Start with the following code that reads the data into plotData and produces a scatter plot of the pulse rates in p:
suppressPackageStartupMessages( library(tidyverse) )
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE) |>
drop_na( exercise, gender, pulse1, pulse2 )
p <- ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )Discover how the facet_grid and facet_wrap functions can be used to create a plot with multiple panels:
p + facet_grid( . ~ exercise )
p + facet_grid( gender ~ . )
p + facet_grid( gender ~ exercise )
p + facet_wrap( ~ year )Note, that the levels of exercise are not shown in the correct order. To correct that, change the plotData table: convert exercise to a factor with levels set to c("low","moderate","high"). Then, recreate the plots with exercise variable.
Solution…
p + facet_grid( . ~ exercise ) # exercise horizontally
p + facet_grid( gender ~ . ) # gender vertically
p + facet_grid( gender ~ exercise ) # gender vertically, exercise horizontally
p + facet_wrap( ~ year ) # year in multiple rows (like writing text)
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE) |>
drop_na( exercise, gender ) |>
mutate( exercise = factor( exercise, levels=c("low","moderate","high") ) )
p <- ggplot(plotData) +
geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )
p + facet_grid( gender ~ exercise )Retrieving JSON weather list (exercise) 📖
Needs: “Lists”.
Gives: as.POSIXct, “epoch”, plot, “fromJSON”, “rjson”.
Practices: str, plot, “fromJSON”, as.POSIXct.
Weather data source and format
Let’s analyse the weather data from publicly available Open-meteo archive. The data can be obtained in the JSON file format (in order to see the structure of a JSON file, copy/paste to a browser tab the link provided below in the url variable).
Retrieving the data
Install the rjson package when necessary. Then, use the code below to download the weather data for Leiden:
Study the structure of the list
- Investigate the
urlvariable. Note that in a certain notation the geographicallatitudeandlongitudeof Leiden are provided. Moreover, the period of time is set from2024-07-01to2024-07-31. Finally, the type of data to be downloaded is specified:hourlytemperature and relative humidity at2mabove the ground. The timezone is set toEurope/Berlin. Look at Historical Weather API if you want to find more details. - Investigate and understand the structure of the
lsobject. - Find a way to extract the numerical vector of temperatures.
- Find the temperature unit in another element of the
lslist. - Note, that the time is provided in the Unix
epochformat (numbers like1719784800...are seconds since1970-01-01 00:00:00 UTC). Extract the vector of times and find a way to convert the time to the human-readable format (as.POSIXct). - Finally, plot the temperature against time using the basic
plotfunction (find out how to draw lines, not points).
Solution…
str(ls)
ls$hourly$temperature_2m
ls$hourly_units$temperature_2m
as.POSIXct(ls$hourly$time)
plot(as.POSIXct(ls$hourly$time), ls$hourly$temperature_2m, type="l")A function to retrieve weather data (exercise) 📖
Needs: “function”, “fromJSON”, as.POSIXct, tibble, stopifnot.
Gives: paste0, is.data.frame, is.numeric.
Practices: “function”, “fromJSON”, as.POSIXct, tibble, paste0, stopifnot, is.data.frame, is.numeric.
Prerequisite
First, solve the exercise “Retrieving JSON weather list”.
Build the function
Build a function getWeatherData which downloads the weather data for a given location and time period. The function should take the following arguments: long (longitude), lat (latitude), startDate, and endDate. The function should return a tibble with columns: time, temperature, and humidity.
Hint: To build url use the paste0 function to merge fixed parts with the parameters provided in long, lat, startDate and endDate.
Solution…
getWeatherData <- function( long, lat, startDate, endDate ) {
url <- paste0(
"https://archive-api.open-meteo.com/v1/archive?",
"latitude=", lat,
"&longitude=", long,
"&start_date=", startDate,
"&end_date=", endDate,
"&hourly=temperature_2m,relative_humidity_2m&timeformat=unixtime&timezone=Europe%2FBerlin"
)
#message( "Accessing URL: ", url )
ls <- rjson::fromJSON( file = url )
tibble(
time = as.POSIXct(ls$hourly$time),
temperature = ls$hourly$temperature_2m,
humidity = ls$hourly$relative_humidity_2m
)
}Here is an example of the function usage and a fragment of the expected result:
d <- getWeatherData( long=4.4931, lat=52.1583, startDate="2024-07-01", endDate="2024-07-31" )
head(d)# A tibble: 6 × 3
time temperature humidity
<dttm> <dbl> <dbl>
1 2024-07-01 01:00:00 13.4 82
2 2024-07-01 02:00:00 12.8 83
3 2024-07-01 03:00:00 13.2 83
4 2024-07-01 04:00:00 13.1 84
5 2024-07-01 05:00:00 13.2 83
6 2024-07-01 06:00:00 13.2 84
Implement some tests
You see by eye that the result in d is somehow fine. However, in the future you may be improving the function and accidentally you may destroy the output. To prevent this, implement some tests to check whether the function works correctly.
Use the stopifnot function to check whether:
- The output is a data frame (
is.data.frame; a tibble is a data frame). - The output has the columns
time,temperature, andhumidity. - The columns
temperatureandhumidityare numeric (is.numeric).
Solution…
d <- getWeatherData( long=4.4931, lat=52.1583, startDate="2024-07-01", endDate="2024-07-31" )
stopifnot( is.data.frame(d) )
stopifnot( all( c("time", "temperature", "humidity") %in% names(d) ) )
stopifnot( is.numeric(d$temperature) )
stopifnot( is.numeric(d$humidity) )Plot weather data for several cities (exercise) 📖
Needs: ggplot, aes, facet_grid, theme_bw, scale_color_continuous, xlab, ylab, 🔴lapply, “function”.
Gives: geom_line, bind_rows.
Practices: “function”, 🔴lapply, bind_rows, ggplot, aes, geom_line, facet_grid, theme_bw, scale_color_continuous, xlab, ylab.
Prerequisite
First, solve the exercise “A function to retrieve weather data”. Extract the function getWeatherData from the solution and use it in the code below.
Goal
The goal is to produce the function plotWeather, which produces a plot of the temperature and humidity for several cities as shown below:
- The function should take an argument
city2coords: a list providing names and geographical coordinates of the cities (see the example). - The
startDateandendDatearguments should specify the time period for which the data should be retrieved. - The function should return the plot. Use
geom_line. The order of the cities in the plot should be the same as in thecity2coordslist. - The
getWeatherDatafunction should be used to retrieve the data for each city. - Use the
lapplyfunction to iterate over the cities and retrieve the weather data as a list of tibbles. Put this list as the argument to thebind_rowsfunction to merge the tibbles into one large tibble for all cities. Use the.idargument ofbind_rowsto add a column with thecitynames.
The plot
The following code is expected to work and reproduce the shown plot. Once your implementation is ready, try it with two additional cities of your choice.
Solution…
plotWeather <- function( city2coords, startDate, endDate ) {
city2weatherData <- lapply( city2coords, function( coords ) {
getWeatherData(
long=coords$long, lat=coords$lat,
startDate=startDate, endDate=endDate
)
} )
d <- bind_rows( city2weatherData, .id="city" )
ggplot( d, aes(x=time, y=temperature, color=humidity ) ) +
geom_line() +
facet_grid( city ~ . ) +
theme_bw() +
scale_color_continuous(
low="yellow", high="blue", name="Humidity\n[rel%]"
) +
xlab( "Date/time" ) +
ylab( "Temperature [°C]" )
}Real Data Analysis: Example data analysis problem (cont.) (exercise) 📖
Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.
Continue with the task provided before.
Data manipulation 3/3
Map and user-defined functions (lecture) 👩🏫
1. User-defined Function
A user-defined function is a custom function created by the user to perform specific tasks that may not be available in R’s built-in functions. User-defined functions are reusable and are called by their given name. They lead to a more organised code and are easier to maintain.
# Add 1 to the vector x
add_one <- function(x) {
# function body
x + 1
}# function call
add_one(2:3)[1] 3 4
2. Anonymous Functions
-
What is an Anonymous Function?
- An anonymous function (also called a “lambda” function) is a function without a name.
- They are useful for small, one-time tasks.
-
Syntax in R:
- Anonymous functions in R use
functionwithout assigning it to a variable. - Example:
function(x) x + 1
- Anonymous functions in R use
-
When to Use Anonymous Functions:
- For simple operations in single-use scenarios.
- Inside other functions (e.g., mapping functions).
- They make code shorter and more readable.
**Basic Syntax* for example, take the function definition of add_one without naming it and pass arguments to it in composition:
(function(x) x + 1)(2:3)[1] 3 4
Anonymous functions are best for short and straightforward tasks where iteration is involved.
This is best shown in the next section with the purrr::map family of functions.
3. tidyverse purrr map and pmap functions
map(.x, .f, ...) is part of purrr, a sub-package of tidyverse. It applies function .f to each element of a list or vector .x and returns a list.
Example: doubling numbers in a vector with an anonymous function.
map(1:5, function(x) x * 2)[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
[[4]]
[1] 8
[[5]]
[1] 10
pmap is the ‘parallelised’ version of map supporting functions (.f) with more than 1 argument. It applies a function to elements across multiple lists or columns in a data frame. The notion parallel is used to signify element-wise processing.
Example: adding numbers from three vectors:
The map and pmap functions have specialised versions to return vectors of atomic types.They are suffixed with _int, _dbl, _lgl etc. for integer, double and logical respectively as returned types.
4. Anonymous functions, shorthand with ’' the ~ notation
map example:
Here, .x represents the argument to the anonymous function.
pmap example:
pmap_dbl(list(1:5, 6:10, 11:15), ~ .x) # 1st input
pmap_dbl(list(1:5, 6:10, 11:15), ~ .x + .y) # parallel sum of 1st and 2nd element
pmap_dbl(list(1:5, 6:10, 11:15), ~ .x + .y + .z) # error: only .x and .y are defined.
pmap_dbl(list(1:5, 6:10, 11:15), ~ ..1 + ..2 + ..3) # NOT RECOMMANDED
pmap_dbl(list(1:5, 6:10, 11:15), function(a,b,c) a+b+c) # anonymous function
pmap_dbl(list(1:5, 6:10, 11:15), \(a,b,c) a+b+c) # '\' shorthand for 'function' Map (exercise) 📖
- Add a new column
row_sumto the tabledholding the sum of all values in each row.
d <- tibble(a = 1:3, b = 4:6, c = 7:9)-
mapvspmap
Take the vector v and tibble t:
v <- 1:4
t <- tibble(a = 1:4, b = 5:8)Which of the following statements are true?
-
map(v, ~ .x * 2)andpmap(v, ~ .x * 2)will give the same output.
-
pmap(t, sum)will compute the sum ofaandbfor each row indf.
-
map(t, sum)is equivalent topmap(t, sum)when applied to a data frame.
-
map()always returns a list, whilepmap()returns a tibble by default.
In msleep mammals sleep dataset (tidyverse), find out which variables have missing values?
Consider a simple list of grades:
- Find out how to use the lapply function to calculate the mean of the grades of each student.
- Find out how in lapply to pass additional arguments for the mean function, so you can calculate the mean omitting the missing values.
- Replace lapply with sapply. Observe how the type of the result changed.
- Calculate the number of grades of each student (including the missing ones). This should be a named numerical vector.
- Produce a numerical vector with counts of missing values for each student. Hint: use apply functions twice (with is.na and sum).
- Produce a logical telling whether a name has any missing grades. Hint: understand functions any or all.
- Reorder the list of grades so, that the names are in the alphabetical order, and the grades are sorted from the largest to the smallest. Don’t drop NA values (just keep them at the end).
- …
Solution…
# 1)
d |> mutate(row_sum = pmap_dbl(d, function(a, b, c) a + b + c))
# 2) b
# 3)
map_lgl(msleep, function(x) any(is.na(x))) # alternatively use anyNA (see ?NA)
# 4)
grades <- list(
Bob = c( 6, 7, NA, 7, NA, 5, 5, 8 ),
Alice = c( 8, 9, 7.5, 8, 9 ),
Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 ),
Charlie = c( 9, 7, 9 ),
Eve = c( 5.5 )
)
# 4.a)
map(grades, mean)
# 4.b)
map(grades, mean, na.rm=TRUE)
# 4.c)
map_dbl(grades, mean, na.rm=TRUE)
# 4.d)
map_dbl(grades, length)
# 4.e)
map_dbl(map(grades,is.na), sum)
# 4.f)
map_lgl(map(grades,is.na), any)
# 4.g)
map(grades[sort(names(grades))], sort, decreasing=TRUE, na.last=TRUE )
# 5)Iterating over list elements with lapply and sapply (exercise) 📖
Needs: list, mean, length, sort.
Gives: lapply, sapply, any.
Practices: lapply, sapply, any.
Base-R provides mechanisms to apply a function to each element of a list (or another iterable object, e.g. a vector) and automatically build a list with results. If the iterable object has names, they are copied to the output list.
Example list
Let’s consider a simple list of grades:
Calculations
- Find out how to use the
lapplyfunction to calculate themeanof the grades of each student. - Find out how in
lapplyto pass additional arguments for themeanfunction, so you can calculate the mean omitting the missing values. - Replace
lapplywithsapply. Observe how the type of the result changed. - Calculate the number of grades of each student (including the missing ones). This should be a named numerical vector.
- Produce a numerical vector with counts of missing values for each student. Hint: use apply functions twice (with
is.naandsum). - Produce a logical telling whether a name has any missing grades. Hint: understand functions
anyorall. - Reorder the list of grades so, that the names are in the alphabetical order, and the grades are sorted from the largest to the smallest. Don’t drop
NAvalues (just keep them at the end).
Solution…
lapply( grades, mean )
lapply( grades, mean, na.rm = TRUE ) # additional arguments to mean
sapply( grades, mean, na.rm = TRUE )
sapply( grades, length )
sapply( lapply( grades, is.na ), sum )
sapply( lapply( grades, is.na ), any )
n2g <- lapply( grades, sort, decreasing=TRUE, na.last=TRUE )
n2g[ sort( names( n2g ) ) ]Merge tables (lecture) 👩🏫
*_join(...) is a family of functions for combining two tibbles on common variable(s) called key.
For example, take a group of people (n=7) for which we have collected age and height data:
# height
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
height = c(160, 172, 182, 157, 162))
df1# A tibble: 5 × 2
name height
<chr> <dbl>
1 Isa 160
2 Jaylinn 172
3 Mila 182
4 Milas 157
5 Yara 162
# A tibble: 5 × 2
name age
<chr> <dbl>
1 Fiene 20
2 Jaylinn 24
3 Mila 17
4 Noah 23
5 Yara 17
The information here is incomplete, for some we have only height and for some only age. Now we want to combine these tibbles into a single table with variables name, age and height. We can do this with a join function which is able to combine observations by matching on the common variable(s) between the two tibbles also known as keys. Matching is done by one or more variables, in this case the variable name.
Note that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.
inner_join Return all rows in df1 where there are matching values of name in df2 and all columns in both df1 and df2:
inner_join(df1,df2, by = "name")# A tibble: 3 × 3
name height age
<chr> <dbl> <dbl>
1 Jaylinn 172 24
2 Mila 182 17
3 Yara 162 17
left_join Return all rows from df1 and all columns in both df1 and df2, NA for missing values in df2:
left_join(df1,df2, by = "name")# A tibble: 5 × 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
full_join All rows and all columns in df1 and df2.
full_join(df1,df2, by = "name")# A tibble: 7 × 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
6 Fiene NA 20
7 Noah NA 23
Bind tables (lecture) 👨🏫
Often we have different data sets with i) the same set of variables or ii) same set of observations but different variables which we would like to combine:
- bind_rows(…) function creates a new tibble from a set of tibbles with not necessarily common variables.
- treats tibbles as a collection of unordered variables.
- bind_cols(…) function creates a new tibble from a set of tibbles with common observations.
- expects the same number of observations in each tibble
bind_rows(t1,t2) # compatible# A tibble: 6 × 2
a b
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
bind_rows(t1,t3) # incompatible# A tibble: 6 × 2
a b
<int> <chr>
1 1 a
2 2 b
3 3 c
4 6 <NA>
5 7 <NA>
6 8 <NA>
To combine variables column-wise use bind_cols:
Note that the user is responsible for the order of observations in each tibble.
Merge tables (exercise) 📖
Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcJoin”, “survey.csv”, “pulseNA.csv”, 🔴“LcMutate”, 🔴tidyverse.
- Unique keys To consistently merge tables with the given key(s) it is important to make sure that the chosen key(s) contain no duplicates.
1.1) Take a look at the function
distinct(?distinct) and use it to check whether the variablenamein the pulse dataset has no duplicates and each name can uniquely identify an observation(row) in the table.1.2) Do the same test as in 1.1 with the function
duplicated(?duplicated)
The variable
namein thesurveydataset has duplicates, can you find other variables that would make it distinct when combined withname?Print the rows with duplicate names in
pulseandsurveydataset.
Solution…
# 1
#
# 1.1
#
# There are less distinct names than total nr. of observations, therefore some
# names must be duplicated.
#
check_names <- list( nr_of_distinct_names = pulse |> distinct(name) |> nrow(),
nr_of_names = pulse |> nrow())
# Test equality: if FALSE then there are duplicates otherwise not.
check_names$nr_of_distinct_names == check_names$nr_of_names
# 1.2
#
pulse |> summarise(nr_of_distinct_names = sum( ! duplicated(name)), nr_of_names = n() )
# 2
(survey |> distinct(name,height) |> nrow() ) == (survey |> nrow())
(survey |> distinct(name,span1) |> nrow() ) == (survey |> nrow())
(survey |> distinct(name,age) |> nrow() ) == (survey |> nrow())
# 3
# pulse
pulse_dupl_names <- pulse |> filter(duplicated(name)) |> pull(name)
pulse |> filter(name %in% pulse_dupl_names) |> arrange(name)
# survey
survey_dupl_names <- survey |> filter(duplicated(name)) |> pull(name)
survey |> filter(name %in% survey_dupl_names) |> arrange(name)Bind tables (exercise) 📖
- In the survey dataset what is the mean height of the first 15 and the 30 observations combined?
- Convert the list below to a tibble with variables
nameandheight:
heights <- list(Andre=182.88, Benjamin=175.26, Alyson=173, Edward=175)- How many duplications of (
name,height) do you find in the combinedsurveyandpulsedataset?
Solution…
#
survey <- read_csv("survey.csv", show_col_types = FALSE)
pulse <- read_csv("pulseNA.csv", show_col_types = FALSE)
# 1)
bind_rows(survey |> head(15), survey |> tail(30)) |>
summarise(mean_height=mean(height, na.rm=TRUE))
# 2)
heights <- list(Andre=182.88, Benjamin=175.26, Alyson=173, Edward=175)
lapply(names(heights),
function(name_) tibble(name=name_, height=heights[[name_]])) |>
bind_rows()
# 3)
bind_rows(pulse |> select(name, height), survey |> select(name, height)) |>
duplicated() |>
sum()Date/Time (lecture) 👩🏫
lubridate is a sub-package in tidyverse for working with dates and times in R. Handling date-time data can be tricky due to different formats, time zones and lubridate simplifies it by providing functions for parsing, extracting, and manipulating dates and times.
1. Parsing Dates and Times
lubridate has the following functions to convert date-time strings into proper date or datetime objects in R:
-
ymd(): Parses dates in “Year-Month-Day” format. -
mdy(): Parses dates in “Month-Day-Year” format. -
dmy(): Parses dates in “Day-Month-Year” format.
Similarly, for date-times (dates with times included), lubridate offers:
Example:
date1 |> class()[1] "Date"
date2 <- dmy_hms("21-11-2024 11:15:00")
date2[1] "2024-11-21 11:15:00 UTC"
2. Extracting Components of Dates and Times
Once you have a date or date-time object, you often need to extract components like the year, month, or day. lubridate provides straightforward functions for this:
-
year(): Extracts the year. -
month(): Extracts the month. -
day(): Extracts the day of the month. -
hour(),minute(),second(): Extracts the hour, minute, or second from a date-time.
Example:
3. Modifying Dates and Times
You can also modify parts of a date-time object directly. For instance, if you want to change the month or add days to a date, lubridate provides functions like update() and simple arithmetic.
-
update(): Modifies specific parts of a date-time object. - Arithmetic: Adding or subtracting days, months, etc.
Example:
Note: years, months, etc. will create periods of date/time for date calculation, not to be confused with year, month, etc. which are used for extraction.
4. Working with Intervals and Durations
For date-based calculations, lubridate provides intervals, durations, and periods to help calculate differences between dates or add specific time amounts.
-
interval(): Creates an interval between two dates. -
duration(): Measures exact time in seconds (useful for precise calculations). -
period(): Measures human-readable time units like days, months, or years.
Example:
# interval between two dates
intrvl1 <- interval( dmy_hms("18-11-2024 11:15:00"), dmy_hms("22-11-2024 11:15:00"))
intrvl1[1] 2024-11-18 11:15:00 UTC--2024-11-22 11:15:00 UTC
intrvl1 |> class()[1] "Interval"
attr(,"package")
[1] "lubridate"
# duration in days
as.duration(intrvl1) /ddays(1) [1] 4
# period
as.period(intrvl1)[1] "4d 0H 0M 0S"
Summary
lubridate simplifies date-time manipulation provides:
- Easy parsing functions to convert strings to dates.
- Extraction tools for getting specific date-time components.
- Modification options to adjust dates and times.
- Intervals and durations for date calculations.
Date/Time (exercise) 📖
-
Date/Time formats Using the function
formatone can format dates/times, for example givennew_years_eveof typeDate:
new_years_eve # December 31, 2024 at 6 PM[1] "2024-12-31 18:00:00 UTC"
format(new_years_eve, "%Y-%m-%d")[1] "2024-12-31"
where %Y, %m and %d stands for year, month and day respectively. See section under ?strptime for more formatting options.
Exercise format new_years_eve in the following formats:
[1] "Tuesday, 31 December 2024"
[1] "Tue, Dec 31"
[1] "2024-12-31 18:00:00"
[1] "2024-12-31 06:00 PM"
[1] "Day 366 of the year, Week 52"stormsdataset is part of the tidyverse package. Create a new dataframestorms_dateswhere the date related variables are replaced with variabledateof typeDate.List top 3 storms with longest period in
storms_dates. Note that the storms are monitored and reported often every 6 hours but also in other intervals.
tibble [19,537 × 13] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:19537] "Amy" "Amy" "Amy" "Amy" ...
$ year : num [1:19537] 1975 1975 1975 1975 1975 ...
$ month : num [1:19537] 6 6 6 6 6 6 6 6 6 6 ...
$ day : int [1:19537] 27 27 27 27 28 28 28 28 29 29 ...
$ hour : num [1:19537] 0 6 12 18 0 6 12 18 0 6 ...
$ lat : num [1:19537] 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
$ long : num [1:19537] -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
$ status : Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...
$ category : num [1:19537] NA NA NA NA NA NA NA NA NA NA ...
$ wind : int [1:19537] 25 25 25 25 25 25 25 30 35 40 ...
$ pressure : int [1:19537] 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
$ tropicalstorm_force_diameter: int [1:19537] NA NA NA NA NA NA NA NA NA NA ...
$ hurricane_force_diameter : int [1:19537] NA NA NA NA NA NA NA NA NA NA ...
Solution…
# 1)
new_years_eve <- dmy_hms("31/12/2024 18:00:00")
format(new_years_eve, "%A, %d %B %Y")
format(new_years_eve, "%a, %b %d")
format(new_years_eve, "%Y-%m-%d %H:%M:%S")
format(new_years_eve, "%Y-%m-%d %I:%M %p")
format(new_years_eve, "Day %j of the year, Week %U")
# 2)
storms_dates <- storms |>
mutate(date=dmy_hms(
paste(paste(day,month,year,sep="-"),
paste(hour,"00","00",sep=":"))
)
) |>
select(-year,-month,-day,-hour) |>
select(name,date,everything())
# 3)
# Note: this solution does not catch storms crossing over years.
storms_dates |>
mutate(year=year(date)) |>
group_by(name,year) |>
mutate(period=as.period(as.interval(min(date),max(date)))) |>
distinct(name,year,period) |>
ungroup() |>
arrange(desc(period)) |>
head(3)Self-Study Assignment
Real Data Analysis: Example data analysis problem (cont.) (exercise) 📖
Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.
Continue with the task provided before.